OPENROWSET and Quoted Data

2020-01-19 0 Comments

Dave Mason - SQL Server OPENROWSET

OPENROWSET has been one of my favorite tools for "querying" data in files that are external to a SQL Server instance. I've written about it a couple times (here and here). One of the challenges I noted was dealing with source data files that have values enclosed in double quotes.

Let's take a look at an example. If you're not familiar with OPENROWSET, here is the basic usage:

SELECT *
FROM OPENROWSET (
 BULK '[Data_Source_File]',
 FORMATFILE = '[Format_File]'
) AS p

For this post, the data source file will be a Performance Monitor log file with sample data from server "Ozymandias". It's available on GitHub. For the format file, I typically start by creating a table in SQL Server, then running the bcp utility with -f and the format option.

Here is a table definition with columns that correspond to the header row of the Perfmon log file (note I've shortened and tidied up the column names):

CREATE TABLE tempdb.guest.PermonStats (
 PerfMonDate VARCHAR(MAX),
 Total_Read_Latency VARCHAR(MAX),
 Total_Write_Latency VARCHAR(MAX),
 Total_IOPS VARCHAR(MAX),
 C_Read_Latency VARCHAR(MAX),
 C_Write_Latency VARCHAR(MAX),
 C_IOPS VARCHAR(MAX),
 D_Read_Latency VARCHAR(MAX),
 D_Write_Latency VARCHAR(MAX),
 D_IOPS VARCHAR(MAX),
 E_Read_Latency VARCHAR(MAX),
 E_Write_Latency VARCHAR(MAX),
 E_IOPS VARCHAR(MAX),
 F_Read_Latency VARCHAR(MAX),
 F_Write_Latency VARCHAR(MAX),
 F_IOPS VARCHAR(MAX),
 G_Read_Latency VARCHAR(MAX),
 G_Write_Latency VARCHAR(MAX),
 G_IOPS VARCHAR(MAX),
 H_Read_Latency VARCHAR(MAX),
 H_Write_Latency VARCHAR(MAX),
 H_IOPS VARCHAR(MAX),
 I_Read_Latency VARCHAR(MAX),
 I_Write_Latency VARCHAR(MAX),
 I_IOPS VARCHAR(MAX),
 J_Read_Latency VARCHAR(MAX),
 J_Write_Latency VARCHAR(MAX),
 J_IOPS VARCHAR(MAX),
 Available_MBytes VARCHAR(MAX),
 Pct_Processor_Time VARCHAR(MAX),
 Processor_Queue_Length VARCHAR(MAX),
 Buffer_cache_hit_ratio VARCHAR(MAX),
 Page_life_expectancy VARCHAR(MAX),
 Page_lookups_per_sec VARCHAR(MAX),
 Page_reads_per_sec VARCHAR(MAX),
 Tansactions_per_sec VARCHAR(MAX),
 Active_Temp_Tables VARCHAR(MAX),
 User_Connections VARCHAR(MAX),
 Target_Server_Memory_KB VARCHAR(MAX),
 Total_Server_Memory_KB VARCHAR(MAX),
 Batch_Requests_per_sec VARCHAR(MAX),
 SQL_Compilations_per_sec VARCHAR(MAX),
 SQL_ReCompilations_per_sec VARCHAR(MAX)
)

Next, I run bcp to generate my XML format file. For reference, the format file (PerfmonStats.ver1.xml) is available on GitHub (we'll modify this file in a bit).

bcp tempdb.guest.PermonStats format nul -c -x -f PerfmonStats.ver1.xml -t, -S .\SQL2017 -T

Ok, now that we've got our data file and format file squared away, let's see what we get with OPENROWSET. Below is the query, followed by the SSMS output.

SELECT *
FROM OPENROWSET (
 BULK 'C:\PerfLogs\SQLServerPerf_000001.csv',
 FORMATFILE = 'C:\PerfLogs\PerfmonStats.ver1.xml',
 FIRSTROW = 2 --skip header row
) AS p
SQL Server OPENROWSET

I haven't shown all the columns, but you get the idea--every column in the result set has data enclosed in double quotes. That's exactly how it appears in the source data file.


Field Terminators

Let's go back to the XML format file. The <RECORD> element describes the data as it is stored in the data file. It contains a set of one or more <FIELD> elements. These elements correspond to fields in the data file. Note that our field terminators are commas, as highlighted below:

<RECORD>
    <FIELD ID="1" xsi:type="CharTerm" TERMINATOR="," COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="2" xsi:type="CharTerm" TERMINATOR="," COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="3" xsi:type="CharTerm" TERMINATOR="," COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    ...
    <FIELD ID="41" xsi:type="CharTerm" TERMINATOR="," COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="42" xsi:type="CharTerm" TERMINATOR="," COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="43" xsi:type="CharTerm" TERMINATOR="\r\n" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>

We can specify different TERMINATOR values, though. Instead of a comma ,, let's change it to double-qoute, comma, double-quote ",". We are already using double-quotes for string literals within the XML format file. But we can use single-quotes instead. With that in mind, let's change the XML format file to this (available on GitHub as ver2):

<RECORD>
    <FIELD ID="1" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    ...
    <FIELD ID="41" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="42" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="43" xsi:type="CharTerm" TERMINATOR="\r\n" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>

If we rerun our OPENROWSET query with version 2 of the XML format file, we've eliminated all of the double-quotes, except for two: an opening double-quote in the first field/column, and a closing double-quote in the last field/column. (Image as been altered to display both the first and last columns of the result set in the SSMS grid.)

SELECT *
FROM OPENROWSET (
 BULK 'C:\PerfLogs\SQLServerPerf_000001.csv',
 FORMATFILE = 'C:\PerfLogs\PerfmonStats.ver2.xml',
 FIRSTROW = 2 --skip header row
) AS p
SQL Server OPENROWSET


Last Field Terminator

Getting rid of the double-quote from the last column is easy enough. Again, we'll change a TERMINATOR, but this time just for the final FIELD (ID="43"). The old value is "\r\n" (\r is a carriage return character and \n is a newline character). The new value will be '"\r\n'. This variant of the XML format file is available on GitHub as ver3:

<RECORD>
    <FIELD ID="1" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    ...
    <FIELD ID="41" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="42" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="43" xsi:type="CharTerm" TERMINATOR='"\r\n' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>

Here's our OPENROWSET query with version 3 of the XML format file. (Once again, the image as been altered to display both the first and last columns of the result set in the SSMS grid.)

SELECT *
FROM OPENROWSET (
 BULK 'C:\PerfLogs\SQLServerPerf_000001.csv',
 FORMATFILE = 'C:\PerfLogs\PerfmonStats.ver3.xml',
 FIRSTROW = 2 --skip header row
) AS p
Dave Mason - SQL Server OPENROWSET


First Field

The remaining double-quote has always been problematic. I've not been able to find a way to "remove" it as we've done with the others. So instead, I'll "hide" it. In the format file, I'll need to insert a new <FIELD> at the first position (ID="1"). That means I'd have to renumber all the remaining fields, incrementing the ID by one. In the <ROW> element, I'd also have to insert a new <COLUMN> at the first position (SOURCE="1") and renumber all the remaining columns, incrementing the SOURCE by one. That's a lot of manual editing. Instead, I'll let bcp do most of the work for me. Drop table tempdb.guest.PermonStats and recreate it with a new column in the first ordinal position:

DROP TABLE tempdb.guest.PermonStats 
GO

CREATE TABLE tempdb.guest.PermonStats (
 Quote CHAR(1),
 PerfMonDate VARCHAR(MAX),
 Total_Read_Latency VARCHAR(MAX),
 Total_Write_Latency VARCHAR(MAX),
 Total_IOPS VARCHAR(MAX),
 C_Read_Latency VARCHAR(MAX),
 C_Write_Latency VARCHAR(MAX),
 C_IOPS VARCHAR(MAX),
 D_Read_Latency VARCHAR(MAX),
 D_Write_Latency VARCHAR(MAX),
 D_IOPS VARCHAR(MAX),
 E_Read_Latency VARCHAR(MAX),
 E_Write_Latency VARCHAR(MAX),
 E_IOPS VARCHAR(MAX),
 F_Read_Latency VARCHAR(MAX),
 F_Write_Latency VARCHAR(MAX),
 F_IOPS VARCHAR(MAX),
 G_Read_Latency VARCHAR(MAX),
 G_Write_Latency VARCHAR(MAX),
 G_IOPS VARCHAR(MAX),
 H_Read_Latency VARCHAR(MAX),
 H_Write_Latency VARCHAR(MAX),
 H_IOPS VARCHAR(MAX),
 I_Read_Latency VARCHAR(MAX),
 I_Write_Latency VARCHAR(MAX),
 I_IOPS VARCHAR(MAX),
 J_Read_Latency VARCHAR(MAX),
 J_Write_Latency VARCHAR(MAX),
 J_IOPS VARCHAR(MAX),
 Available_MBytes VARCHAR(MAX),
 Pct_Processor_Time VARCHAR(MAX),
 Processor_Queue_Length VARCHAR(MAX),
 Buffer_cache_hit_ratio VARCHAR(MAX),
 Page_life_expectancy VARCHAR(MAX),
 Page_lookups_per_sec VARCHAR(MAX),
 Page_reads_per_sec VARCHAR(MAX),
 Tansactions_per_sec VARCHAR(MAX),
 Active_Temp_Tables VARCHAR(MAX),
 User_Connections VARCHAR(MAX),
 Target_Server_Memory_KB VARCHAR(MAX),
 Total_Server_Memory_KB VARCHAR(MAX),
 Batch_Requests_per_sec VARCHAR(MAX),
 SQL_Compilations_per_sec VARCHAR(MAX),
 SQL_ReCompilations_per_sec VARCHAR(MAX)
)

I run bcp again to generate the fourth version of the format file:

bcp tempdb.guest.PermonStats format nul -c -x -f PerfmonStats.ver4.xml -t, -S .\SQL2017 -T

Once again the TERMINATOR of every FIELD needs to be changed. Field ID=1 corresponds to the new "Quote" column we added to the SQL table. Change its terminator from a comma "," to a double-quote '"'. As was done in the previous examples, change the terminators for the remaining fields (except for the last one) from "," to '","' and the terminator for the last field from "\r\n" to '"\r\n'

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
      <FIELD ID="1" xsi:type="CharTerm" TERMINATOR='"' MAX_LENGTH="1" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
      <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
      <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
      ...
      <FIELD ID="42" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
      <FIELD ID="43" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
      <FIELD ID="44" xsi:type="CharTerm" TERMINATOR='"\r\n' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
 </RECORD>

Here's our OPENROWSET query with version 4 of the XML format file (available on GitHub as ver4):

SELECT *
FROM OPENROWSET (
 BULK 'C:\PerfLogs\SQLServerPerf_000001.csv',
 FORMATFILE = 'C:\PerfLogs\PerfmonStats.ver4.xml',
 FIRSTROW = 2 --skip header row
) AS p
Dave Mason - SQL Server OPENROWSET

We're almost there. The final double-quote character is gone. But we have a pesky "Quote" column in the mix. We could omit it in the result set by specifying all of the remaining columns in our SELECT statement. But that's a lot of typing. And I'm lazy. Best practices be damned, I want to use a SELECT * query!


Hiding A Column

Let's make one more trip back to the format file. Scroll down to the ROW element and comment out the first COLUMN element (SOURCE="1"):

 <ROW>
    <!--COLUMN SOURCE="1" NAME="Quote" xsi:type="SQLCHAR"/-->
    <COLUMN SOURCE="2" NAME="PerfMonDate" xsi:type="SQLVARYCHAR"/>
    <COLUMN SOURCE="3" NAME="Total_Read_Latency" xsi:type="SQLVARYCHAR"/>
    ...
    <COLUMN SOURCE="42" NAME="Batch_Requests_per_sec" xsi:type="SQLVARYCHAR"/>
    <COLUMN SOURCE="43" NAME="SQL_Compilations_per_sec" xsi:type="SQLVARYCHAR"/>
    <COLUMN SOURCE="44" NAME="SQL_ReCompilations_per_sec" xsi:type="SQLVARYCHAR"/>
</ROW>

Here's our final OPENROWSET query with version 5 of the XML format file (available on GitHub as ver5):

SELECT *
FROM OPENROWSET (
 BULK 'C:\PerfLogs\SQLServerPerf_000001.csv',
 FORMATFILE = 'C:\PerfLogs\PerfmonStats.ver5.xml',
 FIRSTROW = 2 --skip header row
) AS p
Dave Mason - SQL Server OPENROWSET

All the double quote characters are gone, the "Quote" column is hidden, and we have the convenience of using SELECT * in our OPENROWSET query.

0 comments: