sp_execute_external_script and Bytecode Version Mismatch

You may run into some trouble with sp_execute_external_script and R when you pass in CHAR and/or VARCHAR "string" data as part of your input data set. Here is an example:

DECLARE @QueryIn NVARCHAR(MAX) = N'
 SELECT  ''Dave'' AS FirstName, ''Mason'' AS LastName
 UNION ALL
 SELECT ''Björk'', ''Guðmundsdóttir'';';

EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'Returned_Names <- Input_Names',
 @input_data_1 = @QueryIn,
 @input_data_1_name = N'Input_Names',
 @output_data_1_name = N'Returned_Names';


A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.
Msg 39019, Level 16, State 2, Line 2
An external script error occurred: 
During startup - Warning message:
In setJsonDatabasePath(system.file("extdata/capabilities.json",  :
  bytecode version mismatch; using eval
Error in sort.list(y) : invalid input 'Björk' in 'utf8towcs'
Calls: data.frame ... as.data.frame -> as.data.frame.character -> factor -> sort.list

Error in execution.  Check the output for more information.
Error in eval(expr, envir, enclos) : 
  Error in execution.  Check the output for more information.
Calls: source -> withVisible -> eval -> eval -> .Call
Execution halted

A quick turnaround for this issue is to CAST the "string" data as NVARCHAR:

DECLARE @QueryIn NVARCHAR(MAX) = N'
 SELECT  CAST(''Dave'' AS NVARCHAR(32)) AS FirstName, CAST(''Mason'' AS NVARCHAR(32)) AS LastName
 UNION ALL
 SELECT CAST(''Björk'' AS NVARCHAR(32)), CAST(''Guðmundsdóttir'' AS NVARCHAR(32));';

EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'Returned_Names <- Input_Names',
 @input_data_1 = @QueryIn,
 @input_data_1_name = N'Input_Names',
 @output_data_1_name = N'Returned_Names';

Dave Mason - sp_execute_external_script


Personal Experience

During my testing, I've found R handles CHAR and VARCHAR data within the input data set as long as the ASCII codes comprising the data is in the range from 0 to 127. This much is not surprising--those are the character codes for the ASCII table. Starting with character code 128, R begins having some trouble. To demonstrate, I'll use this query that uses undocumented system table spt_values:

SELECT v.number AS [ASCII Code], 
 CHAR(v.number) AS [Char Value]
FROM master..spt_values v
WHERE v.type = 'p'
AND v.number <= 191

Here is the query above passed as the input data set to sp_execute_external_script:

DECLARE @QueryIn NVARCHAR(MAX) = N'
 SELECT v.number, CHAR(v.number)
 FROM master..spt_values v
 WHERE v.type = ''p''
 AND v.number <= 191;';

 EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'OutputDataSet <- InputDataSet',
 @input_data_1 = @QueryIn
WITH RESULT SETS (
 (
  [ASCII Code] INT,
  [Character Value] NCHAR(1)
 )
)

In the output result set, we see character values are interpreted correctly up until ASCII code 128. From there onward, characters are returned.

Dave Mason - sp_execute_external_script - ASCII Codes

Those of you that were watching closely noticed I only selected ASCII codes up to 191. Why did I stop there? If the input data set includes any ASCII codes of 192 or above, we get the "bytecode version mismatch" error:

DECLARE @QueryIn NVARCHAR(MAX) = N'
 SELECT v.number, CHAR(v.number)
 FROM master..spt_values v
 WHERE v.type = ''p''
 AND v.number <= 192;';

 EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'OutputDataSet <- InputDataSet',
 @input_data_1 = @QueryIn
WITH RESULT SETS (
 (
  [ASCII Code] INT,
  [Character Value] NCHAR(1)
 )
)


Msg 39004, Level 16, State 20, Line 15
A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.
Msg 39019, Level 16, State 2, Line 15
An external script error occurred: 
During startup - Warning message:
In setJsonDatabasePath(system.file("extdata/capabilities.json",  :
  bytecode version mismatch; using eval
Error in sort.list(y) : invalid input 'À' in 'utf8towcs'
Calls: data.frame ... as.data.frame -> as.data.frame.character -> factor -> sort.list

Error in execution.  Check the output for more information.
Error in eval(expr, envir, enclos) : 
  Error in execution.  Check the output for more information.
Calls: source -> withVisible -> eval -> eval -> .Call
Execution halted


But wait! There's more...

All of the testing examples I've show thus far have been with R version 3.3.3. This is what got installed when I installed Machine Learning Services with my SQL Server 2017 instance.

EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'dfReturn <- as.data.frame(as.character(version["version.string"]))',
 @output_data_1_name = N'dfReturn';


-----------------------------------
R version 3.3.3 (2017-03-06)

(1 row affected)

On another instance of SQL Server 2017, I was able to upgrade the R version to 3.5.2 (this was more troublesome than I expected--there were many hoops to jump through).

EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'dfReturn <- as.data.frame(as.character(version["version.string"]))',
 @output_data_1_name = N'dfReturn';


-----------------------------------
R version 3.5.2 (2018-12-20)

(1 row affected)

With the newer version of R, I've not been able to reproduce the "bytecode version mismatch" error. ASCII codes from 128 - 255 return a character, while ASCII codes 256 and above return NULL:

DECLARE @QueryIn NVARCHAR(MAX) = N'
 SELECT v.number, CHAR(v.number)
 FROM master..spt_values v
 WHERE v.type = ''p''
 AND v.number <= 2048;';

 EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'OutputDataSet <- InputDataSet',
 @input_data_1 = @QueryIn
WITH RESULT SETS (
 (
  [ASCII Code] INT,
  [Character Value] NCHAR(1)
 )
)

Dave Mason - sp_execute_external_script - Extended ASCII Codes

Mapping types between one system and another can be tricky. Tough choices sometimes have to be made. In closing, I'll simply recommend you cast SQL types CHAR and VARCHAR to NCHAR and NVARCHAR respectively for input data sets to sp_execute_external_script. Happy coding, everyone!


An Alternative to BCP

Dave Mason - rstats

BCP and OPENROWSET are long-lived SQL Server options for working with data in external files. I've blogged about OPENROWSET, including a recent article showing a way to deal with quoted data. One of the shortcomings I've never been able to overcome is an inconsistent data file with data fields in some rows enclosed in double quotes, but not all. I've never found a way around this limitation.

Let's demonstrate with BCP. Below is a sample data file I'll attempt to load into a SQL Server table. Note the data fields highlighted in yellow, which are enclosed in double quotes and contain the field terminator , (comma) character. For reference, the file is also available on Github.

name,phone,DOB,email,SSN,job,company
Stanley Lind,5819626239,2013-07-16,denhj@q.ila,241-63-2894,Chartered loss adjuster,"Mante, Mante and Mante"
Hale Bins,3212496148,1903-02-25,piuylnbstd@gxqshjlek.zfk,206-93-3199,Sports development officer,"Borer, Borer and Borer"
Mikaela O'Keefe,5245629478,1927-09-01,hcpykwmbli@o.qfn,159-33-2513,Museum/gallery conservator,"Jacobs, Jacobs and Jacobs"
Renaldo Kub,3175312457,1918-12-10,ualhdze@xktdvileb.pla,696-91-1436,Comptroller,Jacobs-Jacobs
Keagan Hackett,1374574537,1992-12-01,wicybnlpkr@pgnmy.myq,803-23-7144,Private music teacher,Quitzon LLC
Kiana Cronin m.d.,3297585362,2017-01-24,ctrwj@cpmwtrus.jnx,133-84-8085,Fisheries officer,Witting Ltd
Harlow Durgan,9239627469,1918-04-13,qzfgcrkhno@qijz.que,858-94-5148,"Nurse, mental health",Wilkinson PLC
Sheila Donnelly-Blanda,4657648749,2005-07-19,spmzor@kl.uej,737-58-1638,"Therapist, occupational",Barrows PLC
Mr Colin Turcotte,5634176974,1952-01-18,ksagn@csyarmhd.aev,946-84-5702,Maintenance engineer,Kub-Kub
Lesta Krajcik,6244527243,1921-04-20,b@wmjisa.jec,432-99-2126,Nutritional therapist,White-White

Here is the bcp command:

TITLE "%~n0"
ECHO OFF
COLOR 0E
CLS

bcp tempdb.guest.ImaginaryPeople in "C:\Output\ImaginaryPeople.csv" -S .\SQL2017 -T -c -t, -r "\n" -F 2

pause

And here are the results when we query the newly loaded table. BCP still includes the double quotes as we see on the first three rows, although the , (comma) character is intact and in the right place within the column/field data. However, lines 7 and 8 are wrong. The "job" data is split across two columns/fields. And there's a field terminator , (comma) character now within the "company".

Dave Mason - BCP


An Alternative

Now let's look at an alternative to BCP. Using SQL Server Machine Learning services, we can harness all of the flexibility the R language has to offer (or Python or Java, depending on which version of SQL Server you are using). The code below truncates the table and executes sp_execute_external_script, invoking the R function read.csv().

TRUNCATE TABLE tempdb.guest.ImaginaryPeople;
INSERT INTO tempdb.guest.ImaginaryPeople
EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'  
csvFile <- read.csv(file = "C:\\Output\\ImaginaryPeople.csv", 
                    sep = ",", 
                    header = TRUE,
                    numerals = "no.loss")
csvFile$phone <- as.character(csvFile$phone)',
 @output_data_1_name = N'csvFile';

Here are the query results for the newly loaded table. The double quotes that previously enclosed data columns/fields are gone, all the , (comma) characters are where they should be, and all the field data are in the correct columns.

Dave Mason - sp_execute_external_script - read.csv


Dave's Thoughts

BCP and OPENROWSET have stood the test of time, and rightly so. They're great options. And while you might argue they're imperfect, I will be quick to remind you they are extremely fast. In situations where those tools may not work or are not preferable, there are certainly other options. (SSIS, PowerShell, and C# come to mind.) Maybe R, Python, or Java in-database are other options you'd consider? Comments welcome below. Happy coding!


Pivoting Data Without TSQL

Dave Mason - SQL Server - sp_execute_external_script

Pivoting data in SQL Server is something that comes up infrequently for me. Whenever the need arises, I have to pause and ask myself "What is it I'm trying to do again?". Next I go to the documentation for the T-SQL PIVOT syntax (which I'll never commit to memory) and realize for the umpteenth time the pivoted values have to be hard coded. Then I ponder using dynamic T-SQL because I won't always know the values to pivot at query design time.

If T-SQL isn't a good hammer to PIVOT's nail, perhaps R is. There are different packages that make summarizing and transposing data frames somewhat easy. It can even dynamically pivot unknown values at runtime. But there is a catch, which I'll get to in a bit.

Here's an [AdventureWorks] query showing purchase orders, including the employee that created each PO, the subtotal, and the shipping method.

SELECT 
 p.FirstName + ' ' + p.LastName AS EmployeeName, 
 h.SubTotal, 
 sm.Name AS ShippingMethod
FROM AdventureWorks.Purchasing.PurchaseOrderHeader h
JOIN AdventureWorks.HumanResources.Employee e
 ON e.BusinessEntityID = h.EmployeeID
JOIN AdventureWorks.Person.Person p
 ON p.BusinessEntityID = h.EmployeeID
JOIN AdventureWorks.Purchasing.ShipMethod sm
 ON sm.ShipMethodID = h.ShipMethodID

SQL Server AdventureWorks

While there are likely countless methods and packages that can be used to pivot a data frame in R, I'll focus on the dcast function from the reshape2 package. Here's a small block of code to demonstrate, along with the output:


library(RODBC)
library(reshape2)

dbhandle <- odbcDriverConnect('driver={SQL Server};server=.\\SQL2017;database=AdventureWorks;trusted_connection=true')
res <- sqlQuery(dbhandle, "SELECT 
  p.FirstName + ' ' + p.LastName AS EmployeeName, 
  h.SubTotal, 
  sm.Name AS ShippingMethod
FROM Purchasing.PurchaseOrderHeader h
JOIN HumanResources.Employee e
  ON e.BusinessEntityID = h.EmployeeID
JOIN Person.Person p
  ON p.BusinessEntityID = h.EmployeeID
JOIN Purchasing.ShipMethod sm
  ON sm.ShipMethodID = h.ShipMethodID")
po_sum <- dcast(res, EmployeeName ~ ShippingMethod, value.var = "SubTotal", fun.aggregate = sum)
po_sum

Dave Mason - Microsoft R Client

Most of the R code is a database connection and query. The result set is assigned to the res data frame. The pivoting is done with a single line of code at line 16 that invokes the dcast function. It specifies EmployeeNames as row values, ShippingMethods pivoted as columns, and SubTotals summed. No matter how many different ShippingMethods there are, the dcast function dynamically creates columns for them as needed.

Now, let's see what this looks like using R in-database with sp_execute_external_script:

 DECLARE @QueryIn NVARCHAR(MAX) = N'SELECT 
 p.FirstName + '' '' + p.LastName AS EmployeeName, 
 h.SubTotal, 
 sm.Name AS ShippingMethod
FROM AdventureWorks.Purchasing.PurchaseOrderHeader h
JOIN AdventureWorks.HumanResources.Employee e
 ON e.BusinessEntityID = h.EmployeeID
JOIN AdventureWorks.Person.Person p
 ON p.BusinessEntityID = h.EmployeeID
JOIN AdventureWorks.Purchasing.ShipMethod sm
 ON sm.ShipMethodID = h.ShipMethodID';

 EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'  
library(reshape2)
po_sum <- dcast(purchase_orders, 
  purchase_orders$EmployeeName ~ purchase_orders$ShippingMethod, 
  value.var = "SubTotal", fun.aggregate = sum)',
 @input_data_1 = @QueryIn,
 @input_data_1_name = N'purchase_orders',
 @output_data_1_name = N'po_sum';

Dave Mason - SSMS - sp_execute_external_script

We have our same pivoted result set of 12 rows and 6 columns. But something's missing: the column names. And that's the catch. If you don't use WITH RESULT SETS when invoking sp_execute_external_script, you don't get column names. That's especially important here because the pivoted data values would normally be the column names.

This is one aspect of sp_execute_external_script that vexes me. SQL Server won't infer the column names on its own, even though that meta data is readily available. R has a lot of potential for developers and DBAs, even though we may not be data scientists per se. I'm hoping Microsoft will address this shortcoming in a future Cumulative Update in all versions from SQL 2016 forward.

And maybe you can help! If you would, please upvote this customer feedback item. Thanks for reading...and thanks for voting!


Generating Fake Data in SQL Server With R

SQL Server - sp_execute_external_script - R

I've been thinking about R and how it can be used by developers, DBAs, and other SQL Server professionals that aren't data scientists per se. A recent article about generating a data set of fake transactional data got me thinking about this again and I wondered, can R be used to obfuscate PII data?

In a word, yes. Well, mostly. (More on this in a bit.) As with anything R-related, there are probably multiple packages that are useful for any given task. For this one, I'll focus on the "generator" package. It includes the following functions (whose names should be fairly self-explanatory) for generating fake data:

  • r_credit_card_numbers
  • r_date_of_births
  • r_email_addresses
  • r_full_names
  • r_ipv4_addresses
  • r_latitudes
  • r_longitudes
  • r_national_identification_numbers
  • r_phone_numbers

All of the functions have a required n parameter for the number of observations to generate. For dates of birth and phone numbers, there are some additional optional parameters for customization. Here's an example (using the functions I find most useful) via SQL Server's Machine Learning Service and sp_execute_external_script:

EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(generator)
num_observations <- 100
df <- as.data.frame(r_full_names(num_observations))
df$phone <- r_phone_numbers(num_observations, use_hyphens = TRUE)
df$DOB <- as.character(r_date_of_births(num_observations))
df$email <- r_email_addresses(num_observations)
df$SSN <- r_national_identification_numbers(num_observations)
OutputDataSet <- df'
WITH RESULT SETS ((
 FullName VARCHAR(255),
 PhoneNumber VARCHAR(16),
 DOB DATE,
 EmailAddress VARCHAR(255),
 SSN CHAR(11)
));

R generator package


Dave's Thoughts

This presents a really interesting option, although there are many questions I haven't broached that are outside the scope of this post. Generating random dates and digits is fairly trivial with T-SQL. Generating random (and meaningful) strings is not. Those fake names are pretty cool, although if you are outside of the US, the diversity of those names (or lack of) may be a concern. The email addresses are not realistic in my opinion. They look too random, as you can see by looking at a few of the top-level domain name suffixes. (.lhw? .nxa?)

There's probably not enough functions in the generator package to roll your own complete PII obfuscation solution. However, it appears to be version 1.0. The documentation notes functions for home address and birthplace (among others) are "To be added". If those functions arrive any time soon, I'll be considering using "generator" for some non-production environments.

Are there any other R packages you know of for generating fake PII? Leave a message in the comments!


Installing R Packages for SQL Server Machine Learning Services

SQL Server - sp_execute_external_script - R

I've had some issues installing R packages for SQL Server. There have been a scant few packages I've installed here and there. I muddled through each one, never remembering the steps I took the previous time because I never documented anything. No longer. Here I'll show what has worked for me.

It always starts with an error similar to this when trying to run the sp_execute_external_script procedure in SSMS. It's overly verbose, but with patience, you'll see the problem:

Msg 39004, Level 16, State 20, Line 0
A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.
Msg 39019, Level 16, State 2, Line 0
An external script error occurred: 
During startup - Warning message:
In setJsonDatabasePath(system.file("extdata/capabilities.json",  :
  bytecode version mismatch; using eval
Error in library(generator) : there is no package called 'generator'
Calls: source -> withVisible -> eval -> eval -> library

Error in execution.  Check the output for more information.
Error in eval(expr, envir, enclos) : 
  Error in execution.  Check the output for more information.
Calls: source -> withVisible -> eval -> eval -> .Call
Execution halted


install.packages()

Installing an R package is normally just a one-liner. Here's what it looks like with RStudio:

> install.packages("generator", quiet = TRUE)
package ‘generator’ successfully unpacked and MD5 sums checked

If we try running install.packages() in the SQL Server database engine via sp_execute_external_script, we get an error:

EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'install.packages("generator", quiet = TRUE)'

Msg 39004, Level 16, State 20, Line 0
A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.
Msg 39019, Level 16, State 2, Line 0
An external script error occurred: 
During startup - Warning message:
In setJsonDatabasePath(system.file("extdata/capabilities.json",  :
  bytecode version mismatch; using eval
Warning in install.packages("generator", quiet = TRUE) :
  'lib = "C:/Program Files/Microsoft SQL Server/MSSQL14.SQL2017/R_SERVICES/library"' is not writable
Error in install.packages("generator", quiet = TRUE) : 
  unable to install packages
Calls: source -> withVisible -> eval -> eval -> install.packages

Error in execution.  Check the output for more information.
Error in eval(expr, envir, enclos) : 
  Error in execution.  Check the output for more information.
Calls: source -> withVisible -> eval -> eval -> .Call
Execution halted


R.exe

To install a package, the R.exe utility has been successful for me so far. It's located in this path, that is specific to the version of SQL Server installed and whether it's a default or named instance:
%programfiles%\Microsoft SQL Server\MSSQLnn.Instance\R_SERVICES\bin

Navigate to that path in Windows Explorer, right-click R.exe, and "Run as administrator". This opens a command line interface. I prefer to clear the console first, so I type in and enter shell("cls"). Next, type in install.packages(), the name of the package to install (plus any other desired parameters), and hit enter. Here, I'll specify the "generator" package along with the "lib" parameter, which specifies the directory where the package is to be installed. Note that this path is specific to my instance of SQL Server 2017, which is a named instance ("SQL2017"):

install.packages("generator", lib = "C:/Program Files/Microsoft SQL Server/MSSQL14.SQL2017/R_SERVICES/library")

When typing in the R.exe console, for me the command gets abbreviated--notice the $ character to the far left as the command is entered, and again to the far right after hitting the ENTER key:

SQL Server - R.exe

Now back to SSMS. When I try running my R code again, the "there is no package" error is gone. If there are no other errors, it should work now:

EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'
library(generator)
OutputDataSet <- as.data.frame(r_latitudes(10))
'
WITH RESULT SETS ((
 latitudes VARCHAR(255)
));
sp_execute_external_script SSMS

OPENROWSET and Quoted Data

Dave Mason - SQL Server OPENROWSET

OPENROWSET has been one of my favorite tools for "querying" data in files that are external to a SQL Server instance. I've written about it a couple times (here and here). One of the challenges I noted was dealing with source data files that have values enclosed in double quotes.

Let's take a look at an example. If you're not familiar with OPENROWSET, here is the basic usage:

SELECT *
FROM OPENROWSET (
 BULK '[Data_Source_File]',
 FORMATFILE = '[Format_File]'
) AS p

For this post, the data source file will be a Performance Monitor log file with sample data from server "Ozymandias". It's available on GitHub. For the format file, I typically start by creating a table in SQL Server, then running the bcp utility with -f and the format option.

Here is a table definition with columns that correspond to the header row of the Perfmon log file (note I've shortened and tidied up the column names):

CREATE TABLE tempdb.guest.PermonStats (
 PerfMonDate VARCHAR(MAX),
 Total_Read_Latency VARCHAR(MAX),
 Total_Write_Latency VARCHAR(MAX),
 Total_IOPS VARCHAR(MAX),
 C_Read_Latency VARCHAR(MAX),
 C_Write_Latency VARCHAR(MAX),
 C_IOPS VARCHAR(MAX),
 D_Read_Latency VARCHAR(MAX),
 D_Write_Latency VARCHAR(MAX),
 D_IOPS VARCHAR(MAX),
 E_Read_Latency VARCHAR(MAX),
 E_Write_Latency VARCHAR(MAX),
 E_IOPS VARCHAR(MAX),
 F_Read_Latency VARCHAR(MAX),
 F_Write_Latency VARCHAR(MAX),
 F_IOPS VARCHAR(MAX),
 G_Read_Latency VARCHAR(MAX),
 G_Write_Latency VARCHAR(MAX),
 G_IOPS VARCHAR(MAX),
 H_Read_Latency VARCHAR(MAX),
 H_Write_Latency VARCHAR(MAX),
 H_IOPS VARCHAR(MAX),
 I_Read_Latency VARCHAR(MAX),
 I_Write_Latency VARCHAR(MAX),
 I_IOPS VARCHAR(MAX),
 J_Read_Latency VARCHAR(MAX),
 J_Write_Latency VARCHAR(MAX),
 J_IOPS VARCHAR(MAX),
 Available_MBytes VARCHAR(MAX),
 Pct_Processor_Time VARCHAR(MAX),
 Processor_Queue_Length VARCHAR(MAX),
 Buffer_cache_hit_ratio VARCHAR(MAX),
 Page_life_expectancy VARCHAR(MAX),
 Page_lookups_per_sec VARCHAR(MAX),
 Page_reads_per_sec VARCHAR(MAX),
 Tansactions_per_sec VARCHAR(MAX),
 Active_Temp_Tables VARCHAR(MAX),
 User_Connections VARCHAR(MAX),
 Target_Server_Memory_KB VARCHAR(MAX),
 Total_Server_Memory_KB VARCHAR(MAX),
 Batch_Requests_per_sec VARCHAR(MAX),
 SQL_Compilations_per_sec VARCHAR(MAX),
 SQL_ReCompilations_per_sec VARCHAR(MAX)
)

Next, I run bcp to generate my XML format file. For reference, the format file (PerfmonStats.ver1.xml) is available on GitHub (we'll modify this file in a bit).

bcp tempdb.guest.PermonStats format nul -c -x -f PerfmonStats.ver1.xml -t, -S .\SQL2017 -T

Ok, now that we've got our data file and format file squared away, let's see what we get with OPENROWSET. Below is the query, followed by the SSMS output.

SELECT *
FROM OPENROWSET (
 BULK 'C:\PerfLogs\SQLServerPerf_000001.csv',
 FORMATFILE = 'C:\PerfLogs\PerfmonStats.ver1.xml',
 FIRSTROW = 2 --skip header row
) AS p
SQL Server OPENROWSET

I haven't shown all the columns, but you get the idea--every column in the result set has data enclosed in double quotes. That's exactly how it appears in the source data file.


Field Terminators

Let's go back to the XML format file. The <RECORD> element describes the data as it is stored in the data file. It contains a set of one or more <FIELD> elements. These elements correspond to fields in the data file. Note that our field terminators are commas, as highlighted below:

<RECORD>
    <FIELD ID="1" xsi:type="CharTerm" TERMINATOR="," COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="2" xsi:type="CharTerm" TERMINATOR="," COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="3" xsi:type="CharTerm" TERMINATOR="," COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    ...
    <FIELD ID="41" xsi:type="CharTerm" TERMINATOR="," COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="42" xsi:type="CharTerm" TERMINATOR="," COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="43" xsi:type="CharTerm" TERMINATOR="\r\n" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>

We can specify different TERMINATOR values, though. Instead of a comma ,, let's change it to double-qoute, comma, double-quote ",". We are already using double-quotes for string literals within the XML format file. But we can use single-quotes instead. With that in mind, let's change the XML format file to this (available on GitHub as ver2):

<RECORD>
    <FIELD ID="1" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    ...
    <FIELD ID="41" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="42" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="43" xsi:type="CharTerm" TERMINATOR="\r\n" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>

If we rerun our OPENROWSET query with version 2 of the XML format file, we've eliminated all of the double-quotes, except for two: an opening double-quote in the first field/column, and a closing double-quote in the last field/column. (Image as been altered to display both the first and last columns of the result set in the SSMS grid.)

SELECT *
FROM OPENROWSET (
 BULK 'C:\PerfLogs\SQLServerPerf_000001.csv',
 FORMATFILE = 'C:\PerfLogs\PerfmonStats.ver2.xml',
 FIRSTROW = 2 --skip header row
) AS p
SQL Server OPENROWSET


Last Field Terminator

Getting rid of the double-quote from the last column is easy enough. Again, we'll change a TERMINATOR, but this time just for the final FIELD (ID="43"). The old value is "\r\n" (\r is a carriage return character and \n is a newline character). The new value will be '"\r\n'. This variant of the XML format file is available on GitHub as ver3:

<RECORD>
    <FIELD ID="1" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    ...
    <FIELD ID="41" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="42" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    <FIELD ID="43" xsi:type="CharTerm" TERMINATOR='"\r\n' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>

Here's our OPENROWSET query with version 3 of the XML format file. (Once again, the image as been altered to display both the first and last columns of the result set in the SSMS grid.)

SELECT *
FROM OPENROWSET (
 BULK 'C:\PerfLogs\SQLServerPerf_000001.csv',
 FORMATFILE = 'C:\PerfLogs\PerfmonStats.ver3.xml',
 FIRSTROW = 2 --skip header row
) AS p
Dave Mason - SQL Server OPENROWSET


First Field

The remaining double-quote has always been problematic. I've not been able to find a way to "remove" it as we've done with the others. So instead, I'll "hide" it. In the format file, I'll need to insert a new <FIELD> at the first position (ID="1"). That means I'd have to renumber all the remaining fields, incrementing the ID by one. In the <ROW> element, I'd also have to insert a new <COLUMN> at the first position (SOURCE="1") and renumber all the remaining columns, incrementing the SOURCE by one. That's a lot of manual editing. Instead, I'll let bcp do most of the work for me. Drop table tempdb.guest.PermonStats and recreate it with a new column in the first ordinal position:

DROP TABLE tempdb.guest.PermonStats 
GO

CREATE TABLE tempdb.guest.PermonStats (
 Quote CHAR(1),
 PerfMonDate VARCHAR(MAX),
 Total_Read_Latency VARCHAR(MAX),
 Total_Write_Latency VARCHAR(MAX),
 Total_IOPS VARCHAR(MAX),
 C_Read_Latency VARCHAR(MAX),
 C_Write_Latency VARCHAR(MAX),
 C_IOPS VARCHAR(MAX),
 D_Read_Latency VARCHAR(MAX),
 D_Write_Latency VARCHAR(MAX),
 D_IOPS VARCHAR(MAX),
 E_Read_Latency VARCHAR(MAX),
 E_Write_Latency VARCHAR(MAX),
 E_IOPS VARCHAR(MAX),
 F_Read_Latency VARCHAR(MAX),
 F_Write_Latency VARCHAR(MAX),
 F_IOPS VARCHAR(MAX),
 G_Read_Latency VARCHAR(MAX),
 G_Write_Latency VARCHAR(MAX),
 G_IOPS VARCHAR(MAX),
 H_Read_Latency VARCHAR(MAX),
 H_Write_Latency VARCHAR(MAX),
 H_IOPS VARCHAR(MAX),
 I_Read_Latency VARCHAR(MAX),
 I_Write_Latency VARCHAR(MAX),
 I_IOPS VARCHAR(MAX),
 J_Read_Latency VARCHAR(MAX),
 J_Write_Latency VARCHAR(MAX),
 J_IOPS VARCHAR(MAX),
 Available_MBytes VARCHAR(MAX),
 Pct_Processor_Time VARCHAR(MAX),
 Processor_Queue_Length VARCHAR(MAX),
 Buffer_cache_hit_ratio VARCHAR(MAX),
 Page_life_expectancy VARCHAR(MAX),
 Page_lookups_per_sec VARCHAR(MAX),
 Page_reads_per_sec VARCHAR(MAX),
 Tansactions_per_sec VARCHAR(MAX),
 Active_Temp_Tables VARCHAR(MAX),
 User_Connections VARCHAR(MAX),
 Target_Server_Memory_KB VARCHAR(MAX),
 Total_Server_Memory_KB VARCHAR(MAX),
 Batch_Requests_per_sec VARCHAR(MAX),
 SQL_Compilations_per_sec VARCHAR(MAX),
 SQL_ReCompilations_per_sec VARCHAR(MAX)
)

I run bcp again to generate the fourth version of the format file:

bcp tempdb.guest.PermonStats format nul -c -x -f PerfmonStats.ver4.xml -t, -S .\SQL2017 -T

Once again the TERMINATOR of every FIELD needs to be changed. Field ID=1 corresponds to the new "Quote" column we added to the SQL table. Change its terminator from a comma "," to a double-quote '"'. As was done in the previous examples, change the terminators for the remaining fields (except for the last one) from "," to '","' and the terminator for the last field from "\r\n" to '"\r\n'

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
      <FIELD ID="1" xsi:type="CharTerm" TERMINATOR='"' MAX_LENGTH="1" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
      <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
      <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
      ...
      <FIELD ID="42" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
      <FIELD ID="43" xsi:type="CharTerm" TERMINATOR='","' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
      <FIELD ID="44" xsi:type="CharTerm" TERMINATOR='"\r\n' COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
 </RECORD>

Here's our OPENROWSET query with version 4 of the XML format file (available on GitHub as ver4):

SELECT *
FROM OPENROWSET (
 BULK 'C:\PerfLogs\SQLServerPerf_000001.csv',
 FORMATFILE = 'C:\PerfLogs\PerfmonStats.ver4.xml',
 FIRSTROW = 2 --skip header row
) AS p
Dave Mason - SQL Server OPENROWSET

We're almost there. The final double-quote character is gone. But we have a pesky "Quote" column in the mix. We could omit it in the result set by specifying all of the remaining columns in our SELECT statement. But that's a lot of typing. And I'm lazy. Best practices be damned, I want to use a SELECT * query!


Hiding A Column

Let's make one more trip back to the format file. Scroll down to the ROW element and comment out the first COLUMN element (SOURCE="1"):

 <ROW>
    <!--COLUMN SOURCE="1" NAME="Quote" xsi:type="SQLCHAR"/-->
    <COLUMN SOURCE="2" NAME="PerfMonDate" xsi:type="SQLVARYCHAR"/>
    <COLUMN SOURCE="3" NAME="Total_Read_Latency" xsi:type="SQLVARYCHAR"/>
    ...
    <COLUMN SOURCE="42" NAME="Batch_Requests_per_sec" xsi:type="SQLVARYCHAR"/>
    <COLUMN SOURCE="43" NAME="SQL_Compilations_per_sec" xsi:type="SQLVARYCHAR"/>
    <COLUMN SOURCE="44" NAME="SQL_ReCompilations_per_sec" xsi:type="SQLVARYCHAR"/>
</ROW>

Here's our final OPENROWSET query with version 5 of the XML format file (available on GitHub as ver5):

SELECT *
FROM OPENROWSET (
 BULK 'C:\PerfLogs\SQLServerPerf_000001.csv',
 FORMATFILE = 'C:\PerfLogs\PerfmonStats.ver5.xml',
 FIRSTROW = 2 --skip header row
) AS p
Dave Mason - SQL Server OPENROWSET

All the double quote characters are gone, the "Quote" column is hidden, and we have the convenience of using SELECT * in our OPENROWSET query.


Schema-on-Read for SQL Server?

Dave Mason NoSQL JSON

I've grumbled that SQL Server doesn't natively provide a way to infer JSON schema in a query. It was pointed out to me that it's "schema on read or schema on write" when dealing with JSON data in SQL Server. I'd never heard that phrase before, and it certainly spelled things out in a clear and concise way.

Nonetheless, I had also concluded dynamic T-SQL could be used for the OPENJSON function...and that it would be icky. And here we are. Yes, I went down that path. And yes, it was sorta icky. Let's take a look.

In my estimation, I would need to use one of two options along with the OPENJSON function:

  1. Explicitly specify schema using the WITH clause to format the output.
  2. Define schema in the SELECT list.

Here are examples of each:

DECLARE @JsonData NVARCHAR(MAX) = '[
  {
    "ContactTypeID": 1,
    "Name": "Accounting Manager",
    "ModifiedDate": "2002-06-01T00:00:00"
  },
  {
    "ContactTypeID": 2,
    "Name": "Assistant Sales Agent",
    "ModifiedDate": "2002-06-01T00:00:00"
  },
  {
    "ContactTypeID": 3,
    "Name": "Assistant Sales Representative",
    "ModifiedDate": "2002-13-01T00:00:00"
  }
]'

--Option 1
SELECT *
FROM OPENJSON(@JsonData) 
WITH (
 ContactTypeID INT,
 Name NVARCHAR(50),
 ModifiedDate DATETIME
)

--Option 2
SELECT 
 TRY_CAST(JSON_VALUE(j.value, '$.ContactTypeID') AS INT) AS ContactTypeID,
 TRY_CAST(JSON_VALUE(j.value, '$.Name') AS NVARCHAR(50)) AS Name,
 TRY_CAST(JSON_VALUE(j.value, '$.ModifiedDate') AS DATETIME) AS ModifiedDate
FROM OPENJSON(@JsonData) j

Option #1 results in an error due to the invalid date (highlighted):

Msg 242, Level 16, State 3, Line 19
The conversion of a nvarchar data type to a datetime data type resulted in an out-of-range value.

On the other hand, Option #2 is more forgiving. The use of TRY_CAST does not cause any errors when invalid data is encountered (NULL is returned instead), and all data "rows" are in the result set.

Dave Mason - OPENJSON - SQL Server

OpenJsonTabular

The code for my stored procedure is below. It has a single input parameter for a string of JSON data. The name/value pairs of the first "row" of JSON data is parsed to obtain column names and types for the result set. A query string for the OPENJSON function is constructed and executed via EXEC. JSON's support for data types is pretty sparse compared to SQL Server (notice the big CASE expression where I attempt to do some mapping). If you're using SQL 2016, you'll have to make an edit for the STRING_AGG function.

USE tempdb;
GO

CREATE OR ALTER PROCEDURE dbo.OpenJsonTabular
 @Json NVARCHAR(MAX)
AS
/*
 Dynamically returns a resultset from the input JSON data.
 The JSON data is assumed to be tabular/flat, with schema
 inferred from the first "row" of JSON key/value pairs.
 The JSON data is assumed to be in the format produced by the
  FOR JSON clause with the AUTO option:

 [   
  { "Row1Column1Name": "Row1Column1Value", "Row1Column2Name": "Row1Column2Value", ... "Row1Column(n)Name": "Row1Column(n)Value" },   
  { "Row2Column1Name": "Row2Column1Value", "Row2Column2Name": "Row2Column2Value", ... "Row2Column(n)Name": "Row2Column(n)Value" },   
  ...
  { "Row(n)Column1Name": "Row(n)Column1Value", "Row(n)Column2Name": "Row(n)Column2Value", ... "Row(n)Column(n)Name": "RowColumn(n)Value" },   
 ]

*/
BEGIN
 DECLARE @Tsql NVARCHAR(MAX) = CHAR(9);

 SELECT @Tsql = @Tsql + STRING_AGG(
  'TRY_CAST(JSON_VALUE(j.value, ''$."' + CAST(k.[key] AS VARCHAR(MAX)) + '"'') AS ' + 
  
  --Try to map the JSON type to a SQL Server type.
  CASE
   --JSON null
   WHEN k.type = 0 THEN 'VARCHAR(MAX)'

   --JSON string (double-quoted Unicode with backslash escaping)
   WHEN k.type = 1 THEN 
    CASE 
     WHEN TRY_CAST(k.[value] AS DATETIME) IS NOT NULL THEN 'DATETIME' 
     ELSE 'VARCHAR(MAX)' 
    END

   ----JSON number (double- precision floating-point format in JavaScript)
   WHEN k.type = 2 THEN 
    CASE
     WHEN k.[value] LIKE '%.%' THEN 'NUMERIC(38, 5)'
     WHEN k.[value] LIKE '%,%' THEN 'NUMERIC(38, 5)'
     ELSE 'BIGINT'
    END

   --JSON boolean ("true" or "false")
   WHEN k.type = 3 THEN 'BIT'
   
   --JSON array (ordered sequence of values)
   WHEN k.type = 4 THEN 'VARCHAR(MAX)'

   --JSON object (an unordered collection of key:value pairs)
   WHEN k.type = 5 THEN 'VARCHAR(MAX)'

   ELSE 'VARCHAR(MAX)'  --null
  END + ') AS ' + QUOTENAME(k.[key]), ', ' + CHAR(13) + CHAR(10) + CHAR(9) )

 FROM OPENJSON(@Json) j
 CROSS APPLY OPENJSON(j.value) k
 WHERE j.[key] = 0

 SELECT @Tsql = 'SELECT ' + CHAR(13) + CHAR(10) +
  @Tsql + CHAR(13) + CHAR(10) +
 'FROM OPENJSON(''' + @Json + ''') j';

 --SELECT @Tsql;
 EXEC (@Tsql);
END

Examples

Here are a couple of code samples that show how it works.

DECLARE @Json NVARCHAR(MAX) =
(
 SELECT TOP(5) * 
 FROM AdventureWorks.Production.ProductListPriceHistory
 FOR JSON AUTO
);

EXEC tempdb.dbo.OpenJsonTabular @Json;

SET @Json = 
(
 SELECT TOP(5) *
 FROM msdb.sys.objects
 FOR JSON AUTO
)

EXEC tempdb.dbo.OpenJsonTabular @Json;

As you may have expected, the result sets from each execution are different:

Dave Mason - JSON - OpenJsonTabular - SQL Server

I don't know if any of this could be considered "schema on read". Maybe it's automated schema on read. Or maybe it's just an abomination. It may be icky, but I still had fun working on it. Converting relational data to JSON and then back to a tabular result set isn't terribly useful, of course. But I have at least one use case that involves serializing external data to JSON and then "querying" it with the OpenJsonTabular procedure. Perhaps more on this in another post.