Columnstore For Real-Time Operational Analytics

SQL Server Columnstore - Dave Mason

Starting with SQL Server 2016, new columnstore indexing features allow something called real-time operational analytics. Microsoft's documentation describes it as "the ability to run both analytics and OLTP workloads on the same database tables at the same time". Additionally, it can also "eliminate the need for ETL and a data warehouse".

I tend to work mostly with OLTP environments. Many of them have questionable designs or serve reporting workloads. Not surprisingly, there are a lot of performance-sapping table scans and index scans. I've compensated for this somewhat by using row and page compression, which became available on all editions of SQL Server starting with SQL Server 2016 SP1. Could I get even better results with columnstore indexes? Lets look at one example.

Here are four individual query statements from a stored procedure used to get data for a dashboard. If you add up percentages for Estimated Cost (CPU + IO), Estimated CPU Cost, or Estimated IO Cost, you get a total of about 90% (give or take a few percent).

SQL Server Columnstore - Dave Mason

The first two statements include a nonclustered index scan and the second two both have a clustered index scan. Each of the four scans consumes a large portion of the CPU and/or IO. Two nonclustered columnstore indexes were added. Now lets take a look at the before and after images of the query plans.


Statement 1

Index scan and index seek operators are both replaced by columnstore index scans. Note the table insert operation--presumably, its overall cost would be consistent. Yet as a percentage, it goes up significantly, becoming more "expensive" by virtue of the other operations being more efficient and less "expensive".

SQL Server Rowstore - Dave Mason

SQL Server Columnstore - Dave Mason


Statement 2

As in the previous statement, index scan and index seek operators are both replaced by columnstore index scans. We also see the table insert operation increase in cost.

SQL Server Rowstore - Dave Mason

SQL Server Columnstore - Dave Mason


Statements 3 and 4

The final two statements are nearly identical, so I won't repeat them. It's a different table this time, and it has a clustered index seek and a clustered index scan. Both operations are replaced by columnstore index scans.

SQL Server Rowstore - Dave Mason

SQL Server Columnstore - Dave Mason


With the nonclustered columnstore indexes in place, the relative costs of the four individual query statements go down significantly. Summed up, Estimated Cost (CPU + IO) is down to 58%, Estimated CPU Cost is down to 66%, and Estimated IO Cost is down to 55%.

SQL Server Columnstore - Dave Mason


The execution stats look even better. Elapsed time for the stored procedure drops from 15 seconds to 5 seconds. Logical reads also dropped by 80%. And there's even more good news. By keeping a close eye on index usage stats, I noticed some of the other rowstore indexes on the two tables were no longer being used for index scans. Various other executions plans were now using the new nonclustered columnstore index for those scans instead. Many of those plans saw similar performance gains with reduced CPU and IO.

I am still somewhat new to columnstore indexes, and my sample size is admittedly small. But so far, real-time operational analytics appears to be the best of both worlds. Reporting and analytical queries that scan wide swaths of data are working well with columnstore indexes, while existing rowstore indexes provide traditionally efficient seeks. Are you using nonclustered columnstore indexes on OLTP tables too? Please leave a comment! I haven't found anyone else that's doing this yet. Happy indexing, everyone.


Scraping Web Pages - Part 2

Dave Mason - RCurl XML

In the last post, we looked at a way to scrape HTML table data from web pages, and save the data to a table in SQL Server. One of the drawbacks is the need to know the schema of the data that gets scraped--you need a SQL Server table to store the data, after all. Another shortcoming is if there are multiple HTML tables, you need to identify which one(s) you want to save.

For this post, we'll revisit web scraping with Machine Learning Services and R. This time, we'll take a schema-less approach that returns JSON data. As before, this web page will be scraped: Boston Celtics 2016-2017. It shows two HTML tables (grids) of data for the Boston Celtics, a professional basketball team. The first grid lists the roster of players, the second is a listing of games played during the regular season.


R Code

This code uses the RCurl and XML packages (as was the case in the last post), plus the jsonlite package. There's a bit more code this time. Hopefully the code comments are helpful:

library(RCurl)
library(XML)
library(jsonlite)

#type.convert - sets appropriate class for each data frame variable/column.
TypeConvertDataFrameList <- function(lsDataFrame) {
    lapply(lsDataFrame, type.convert, as.is = TRUE)
}

url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2016-2017.html"
html = RCurl::getURL(url)

dfHtmlTableData <- XML::readHTMLTable(html, header = TRUE,
                                 as.data.frame = TRUE,
                                 stringsAsFactors = FALSE)
#remove null elements from list.
dfHtmlTableData <- dfHtmlTableData[!sapply(dfHtmlTableData, is.null)]

#convert data frame variables to appropriate classes.
dfHtmlTableData <- lapply(dfHtmlTableData, TypeConvertDataFrameList)

dfTables <- data.frame(url = rep(c(url), each = length(dfHtmlTableData)))

#Empty vector for JSON string data.
jsonStringsFactor <- rep(c(""), each = length(dfHtmlTableData))

if (length(dfHtmlTableData) > 0) {
    #Convert each HTML table to JSON.
    for (i in 1:length(dfHtmlTableData)) {
        jsonStringsFactor[i] <- jsonlite::toJSON(
      as.data.frame(dfHtmlTableData[[i]]), auto_unbox = TRUE)
    }
}

dfTables$HtmlTableAsJson <- jsonStringsFactor
dfTables

Lines 6-8 define a separate function that is invoked farther down in the script. Line 10 specifies the URL string for the web page we want to scrape. Line 11 gathers the entirety of HTML from the page. And Line 13 creates a list of data frames. At line 17, NULL elements (if any) are removed from the list. At line 20 the lapply is invoked. It iterates over the list of HTML table data and in turn invokes the TypeConvertDataFrameList from line 5. This function iterates over the variables of a data frame (much like columns of a database table) and attempts to cast them from a character (string) class to a different class, if appropriate. Without this, all of our scraped data will be returned as strings.

At line 22 we begin creating the single data frame that will be returned to SQL Server. A "url" variable (column) is created with one observation (row) for each HTML table found. Each value is the URL of the web page that was scraped. At line 25, a jsonStringsFactor vector is created with one element for each HTML table found. It has empty string values initially. In lines 29-31, the code iterates over each data frame elements in dfHtmlTableData. The data from each data frame element is converted to JSON and assigned to the corresponding element in jsonStringsFactor. Line 35 creates a "HtmlTableAsJson" property (column) for our data frame, and assigns jsonStringsFactor to it. We now have a data frame with two variables (columns) and three observations (rows). Here is the output of the code (truncated for display purposes):

url                                     HtmlTableAsJson                                          
---------------------------------       ----------------------------------------------------------------------------------------------------------------------------------------------------------------
https://...celtics-2016-2017.html       [{"Jersey.Number":0,"Player":"Avery Bradley","Position":"SG","Height":"6-3","Weight":180,"Birthdate":"11/26/1990","Years.Experience":"6","College":"Texas"},...]
https://...celtics-2016-2017.html       [{"Game":1,"Date":"Wed, Oct 26, 2016","Time":"7:30p","At":"","Opponent":"Brooklyn Nets","Result":"W","OT":"","Points.For":122,"Points.Against":117,"Wins.Sea...]
https://...celtics-2016-2017.html       [{}]


SQL Server

With the necessary R code in hand, we can run it in-database with Machine Learning Services. I've wrapped the code within a stored procedure:

CREATE OR ALTER PROCEDURE dbo.ScrapeHtmlTables
 @Url NVARCHAR(MAX)
AS
BEGIN

DECLARE @RScript NVARCHAR(MAX);
SET @RScript = N'library(RCurl)
library(XML)
library(jsonlite)

#type.convert - sets appropriate class for each data frame variable/column.
TypeConvertDataFrameList <- function(lsDataFrame){
  lapply(lsDataFrame, type.convert, as.is = TRUE)
}

url <- "' + @Url + '"
html = RCurl::getURL(url)

dfHtmlTableData <- XML::readHTMLTable(html, header = TRUE,
                                 as.data.frame = TRUE,
                                 stringsAsFactors = FALSE)
#remove null elements from list.
dfHtmlTableData <- dfHtmlTableData[!sapply(dfHtmlTableData, is.null)]

#convert data frame variables to appropriate classes.
dfHtmlTableData <- lapply(dfHtmlTableData, TypeConvertDataFrameList)

dfTables <- data.frame(url = rep(c(url), each = length(dfHtmlTableData)))

#Empty vector for JSON string data.
jsonStringsFactor <- rep(c(""), each = length(dfHtmlTableData))

if (length(dfHtmlTableData) > 0) {
    #Convert each HTML table to JSON.
    for (i in 1:length(dfHtmlTableData)) {
        jsonStringsFactor[i] <- jsonlite::toJSON(
      as.data.frame(dfHtmlTableData[[i]]), auto_unbox = TRUE)
    }
}

dfTables$HtmlTableAsJson <- jsonStringsFactor
OutputDataSet <- dfTables

OutputDataSet';

EXEC sp_execute_external_script 
 @language = N'R',
 @script = @RScript
WITH RESULT SETS (
 (
  [Url] VARCHAR(2048),
  HtmlTableAsJson VARCHAR(MAX)
 )
);
END
GO

Now we can retieve a web page's HTML table data by invoking a stored procedure and passing in a URL string. We don't need to know how many HTML tables there are, and we don't need to know their schemas:

CREATE TABLE #HtmlTableData (
 ID INT IDENTITY,
 [Url] VARCHAR(2048),
 HtmlTableAsJson VARCHAR(MAX)
)

INSERT INTO #HtmlTableData
EXEC dbo.ScrapeHtmlTables @Url = N'https://itsalljustelectrons.blogspot.com/p/boston-celtics-2016-2017.html';

SELECT *
FROM #HtmlTableData;

Dave Mason - RStats - sp_execute_external_script


JSON

From here on out, parsing the returned data is a JSON exercise. We would, of course, need to know the JSON schema to query it in a tabular or meaningful way. Or would we?

In another post, I described an automated way to infer JSON schema using dynamic TSQL. Perhaps that article will pique your interest. Here's a look at what I had in mind.

DECLARE @Json VARCHAR(MAX);
SELECT @Json = d.HtmlTableAsJson
FROM #HtmlTableData d
WHERE d.ID = 2;

EXEC dbo.OpenJsonTabular @Json;
Dave Mason - JSON

Scraping Web Pages

Dave Mason RStats

I don't consider myself a data scientist. But that hasn't stopped me from finding some use cases for Machine Learning Services for SQL Server. One of them is acquiring data by scraping web pages. If you're a regular here, you probably know R is my language of preference. However, Python has packages for scraping web pages. I would imagine java does too.

For this post, it might make more sense to skip ahead to the end result, and then work our way backwards. Here is a web page with some data: Boston Celtics 2016-2017. It shows two HTML tables (grids) of data for the Boston Celtics, a professional basketball team. The first grid lists the roster of players. We will scrape the web page, and write the data from the "roster" grid to a SQL Server table. It will look much like this: Dave Mason sp_execute_external_script


R Code

R has an abundance of packages. This frequently provides numerous options to accomplish any task. For this one, I'll be using the RCurl and XML packages. This bit of code will do most of the work in R:

library(RCurl)
library(XML)
url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2016-2017.html"
html = RCurl::getURL(url);
tables <- XML::readHTMLTable(html, header = TRUE,
                             as.data.frame = TRUE,
                             stringsAsFactors = FALSE);

Lines 1 and 2 load the necessary R packages. Line 3 specifies the URL string for the web page we want to scrape. Line 4 gathers the entirety of HTML from the page. And Line 5 creates a list of data frames. After we run the code, we can check the length of tables to see how many HTML tables were found. In this case, there are four:

> length(tables)
[1] 4
> 

Continuing on, I'll use the head() function to preview the first three rows of each data frame. This shows me the HTML table of "roster" data lies in the first data frame tables[[1]]:

> head(tables[[1]], n = 3L)
+ head(tables[[2]], n = 3L)
+ head(tables[[3]], n = 3L)
+ head(tables[[4]], n = 3L)

  Jersey Number        Player Position Height Weight  Birthdate Years Experience    College
1             0 Avery Bradley       SG    6-3    180 11/26/1990                6      Texas
2             7  Jaylen Brown       SF    6-6    223 10/24/1996                R California
3            99   Jae Crowder       SF    6-6    235 07/06/1990                4  Marquette

  Game              Date  Time At          Opponent Result OT Points For Points Against Wins-Season Losses-Season Streak
1    1 Wed, Oct 26, 2016 7:30p        Brooklyn Nets      W           122            117           1             0    W 1
2    2 Thu, Oct 27, 2016 8:00p  @     Chicago Bulls      L            99            105           1             1    L 1
3    3 Sat, Oct 29, 2016 7:00p  @ Charlotte Hornets      W           104             98           2             1    W 1

  CONNECT
1        

NULL
> 

With that, the R code that we will use in SQL Server via sp_execute_external_script will be as follows:

library(RCurl)
library(XML)
url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2016-2017.html"
html = RCurl::getURL(url);
tables <- XML::readHTMLTable(html, header = TRUE,
                             as.data.frame = TRUE,
                             stringsAsFactors = FALSE);
OutputDataSet <- tables[[1]]


SQL Server

With the necessary R code in hand, let's run it in-database with Machine Learning Services. Since we know the schema, it is specified via WITH RESULT SETS:

EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(RCurl)
library(XML)
url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2016-2017.html";
html = RCurl::getURL(url);
tables <- XML::readHTMLTable(html, header = TRUE, as.data.frame = TRUE, stringsAsFactors = FALSE);
OutputDataSet <- tables[[1]]'
WITH RESULT SETS (
 (
  JerseyNumber VARCHAR(4),
  PlayerName VARCHAR(128),
  Position VARCHAR(4),
  Height VARCHAR(8),
  Weight SMALLINT,
  BirthDate DATE,
  YearsExperience VARCHAR(2),
  College VARCHAR(128)
 )
);

Dave Mason RStats - RCurl XML

If desired, we can save the output to a temp table:

USE tempdb;
GO

DROP TABLE IF EXISTS #Roster;

CREATE TABLE #Roster (
 JerseyNumber VARCHAR(4),
 PlayerName VARCHAR(128),
 Position VARCHAR(4),
 Height VARCHAR(8),
 Weight SMALLINT,
 BirthDate DATE,
 YearsExperience VARCHAR(2),
 College VARCHAR(128)
)

INSERT INTO #Roster
EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(RCurl)
library(XML)
url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2016-2017.html";
html = RCurl::getURL(url);
tables <- XML::readHTMLTable(html, header = TRUE, as.data.frame = TRUE, stringsAsFactors = FALSE);
OutputDataSet <- tables[[1]]'

We can go a step further and write the data to a permanent table (it's tempdb in this example, but you get the idea). Here I'll prepend two columns to indicate the season and team:

USE tempdb;
GO

SELECT CAST('2016-2017' AS VARCHAR(9)) AS Season, 
 CAST('BOS' AS VARCHAR(3)) AS Team, * 
INTO dbo.Roster
FROM #Roster;

Now we have the data as depicted in the first image at the top of this post:

SELECT * FROM dbo.Roster;
Dave Mason RStats - getURL readHTMLTable

As it turns out, there are more web pages of Boston Celtics roster data I can scrape (note the underlying URLs, referenced in the code below):

With a couple of minor edits to the TSQL code, I can grab that data too:
--Data for 2017-2018 season.
INSERT INTO #Roster
EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(RCurl)
library(XML)
url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2017-2018.html";
html = RCurl::getURL(url);
tables <- XML::readHTMLTable(html, header = TRUE, as.data.frame = TRUE, stringsAsFactors = FALSE);
OutputDataSet <- tables[[1]]';

--Data for 2018-2019 season.
INSERT INTO #Roster
EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(RCurl)
library(XML)
url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2018-2019.html";
html = RCurl::getURL(url);
tables <- XML::readHTMLTable(html, header = TRUE, as.data.frame = TRUE, stringsAsFactors = FALSE);
OutputDataSet <- tables[[1]]';


Limitations of Schema

One of the drawbacks to the examples and techniques presented here is that we have to know the schema of the returned data before we can write it to a SQL Server table. If a web page has multiple HTML tables, we also have to figure out which ones have the data we intend to save. In my next post, I'll look at another option for web scraping that is not dependent on schema. In the mean time, there are a few caveats and gothcas to be aware of if you want to try web scraping in-database:

  1. If needed, the two packages can be installed as follows:
    • install.packages("RCurl")
    • install.packages("XML")
  2. By default, the SQL Server Setup disables outbound connections by creating firewall rules. Check out Firewall configuration for SQL Server Machine Learning Services.
  3. The getURL() function in the RCurl package doesn't work well (or at all) with javascript and dynamically generated HTML content. (Try scraping this page, for example.) I'm told there are other R packages that might work for those pages.
  4. Please take a long, hard look at The Web Scraping Best Practices Guide before you begin!


Data Obfuscation for SQL Server

Dave Mason - Data Obfuscation for SQL Server

In a previous post, I explored an option for generating fake data in sql server using Machine Learning services and the R language. I've expanded on that by creating some stored procedures that can be used for both generating data sets of fake data, and for obfuscating existing SQL Server data with fake data.

The code is available in a Github repository. For now, it consists of ten stored procedures. For data obfuscation, you'd use one of more of these five stored procs:

  • Obfuscator.SetAlphaNumericData
  • Obfuscator.SetEmailAddresses
  • Obfuscator.SetFullNames
  • Obfuscator.SetPhoneNumbers
  • Obfuscator.SetSSNs

Each procedure has input parameters for database name, table schema, table name, and column name(s). There is also an optional parameter to indicate whether triggers should be disabled and re-enabled. (Obfuscator.SetSSNs has an additional optional parameter to specify a formatting character.) Here are some examples from the [AdventureWorks] database, showing the data both before and after running each obfuscation proc:


Alphanumeric Data

EXEC MLtools.Obfuscator.SetAlphaNumericData
 @DatabaseName = 'Adventureworks',
 @TableSchema = 'Production',
 @TableName = 'Product',
 @AlphaNumericColumn = 'ProductNumber',
 @DisableTriggers = 1;
Obfuscator.SetAlphaNumericData - Data Obfuscation for SQL Server

BeforeAfter
Obfuscator.SetAlphaNumericData - Data Obfuscation for SQL Server Obfuscator.SetAlphaNumericData - Data Obfuscation for SQL Server


Email Addresses

EXEC MLtools.Obfuscator.SetEmailAddresses
 @DatabaseName = 'Adventureworks',
 @TableSchema = 'Person',
 @TableName = 'EmailAddress',
 @EmailAddressColumn = 'EmailAddress',
 @DisableTriggers = 1;
Obfuscator.SetEmailAddresses - Data Obfuscation for SQL Server

BeforeAfter
Obfuscator.SetEmailAddresses - Data Obfuscation for SQL Server Obfuscator.SetEmailAddresses - Data Obfuscation for SQL Server


Full Names

EXEC MLtools.Obfuscator.SetFullNames
 @DatabaseName = 'Adventureworks',
 @TableSchema = 'Person',
 @TableName = 'Person',
 @FirstNameColumn = 'FirstName',
 @LastNameColumn = 'LastName',
 @MiddleNameColumn = 'MiddleName',
 @DisableTriggers = 1;
Obfuscator.SetFullNames - Data Obfuscation for SQL Server

BeforeAfter
Obfuscator.SetFullNames - Data Obfuscation for SQL Server Obfuscator.SetFullNames - Data Obfuscation for SQL Server


Phone Numbers

EXEC MLtools.Obfuscator.SetPhoneNumbers
 @DatabaseName = 'Adventureworks',
 @TableSchema = 'Person',
 @TableName = 'PersonPhone',
 @PhoneNumColumn = 'PhoneNumber',
 @DisableTriggers = 1;
Obfuscator.SetPhoneNumbers - Data Obfuscation for SQL Server

BeforeAfter
Obfuscator.SetPhoneNumbers - Data Obfuscation for SQL Server Obfuscator.SetPhoneNumbers - Data Obfuscation for SQL Server


Social Security Numbers

EXEC MLtools.Obfuscator.SetSSNs
 @DatabaseName = 'Adventureworks',
 @TableSchema = 'HumanResources',
 @TableName = 'Employee',
 @SsnColumn = 'NationalIDNumber',
 @FormatChar = '-',
 @DisableTriggers = 1;
Obfuscator.SetSSNs - Data Obfuscation for SQL Server

BeforeAfter
Obfuscator.SetSSNs - Data Obfuscation for SQL Server Obfuscator.SetSSNs - Data Obfuscation for SQL Server


Design Notes

The "Obfuscator" stored procedures each invoke a "Generator" procedure. These returns result sets of fake data without performing any changes to existing data:

  • Generator.GetAlphaNumericData
  • Generator.GetEmailAddresses
  • Generator.GetFullNames
  • Generator.GetPhoneNumbers
  • Generator.GetSSNs

The "Generator" and "Obfuscator" functionality were separated for a couple of reasons. The first, perhaps more obvious reason, is that each type of procedure has a different purpose. The other reason has to do with script language. My code use the R language and takes advantage of the "generator" package for R (among others). But perhaps R isn't your favorite. The procedures could easily be rewritten using Python (starting with SQL 2017) or Java (starting with SQL 2019). Those languages may offer other packages/libraries with better or different functionality for whatever your needs are. Or you may simply have more experience with (or a preference for) a different scripting language.

All of the stored procedures have a header and code comments--hopefully it is easy to understand and is not too arcane. Happy obfuscating!


sp_execute_external_script and Bytecode Version Mismatch

You may run into some trouble with sp_execute_external_script and R when you pass in CHAR and/or VARCHAR "string" data as part of your input data set. Here is an example:

DECLARE @QueryIn NVARCHAR(MAX) = N'
 SELECT  ''Dave'' AS FirstName, ''Mason'' AS LastName
 UNION ALL
 SELECT ''Björk'', ''Guðmundsdóttir'';';

EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'Returned_Names <- Input_Names',
 @input_data_1 = @QueryIn,
 @input_data_1_name = N'Input_Names',
 @output_data_1_name = N'Returned_Names';


A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.
Msg 39019, Level 16, State 2, Line 2
An external script error occurred: 
During startup - Warning message:
In setJsonDatabasePath(system.file("extdata/capabilities.json",  :
  bytecode version mismatch; using eval
Error in sort.list(y) : invalid input 'Björk' in 'utf8towcs'
Calls: data.frame ... as.data.frame -> as.data.frame.character -> factor -> sort.list

Error in execution.  Check the output for more information.
Error in eval(expr, envir, enclos) : 
  Error in execution.  Check the output for more information.
Calls: source -> withVisible -> eval -> eval -> .Call
Execution halted

A quick turnaround for this issue is to CAST the "string" data as NVARCHAR:

DECLARE @QueryIn NVARCHAR(MAX) = N'
 SELECT  CAST(''Dave'' AS NVARCHAR(32)) AS FirstName, CAST(''Mason'' AS NVARCHAR(32)) AS LastName
 UNION ALL
 SELECT CAST(''Björk'' AS NVARCHAR(32)), CAST(''Guðmundsdóttir'' AS NVARCHAR(32));';

EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'Returned_Names <- Input_Names',
 @input_data_1 = @QueryIn,
 @input_data_1_name = N'Input_Names',
 @output_data_1_name = N'Returned_Names';

Dave Mason - sp_execute_external_script


Personal Experience

During my testing, I've found R handles CHAR and VARCHAR data within the input data set as long as the ASCII codes comprising the data is in the range from 0 to 127. This much is not surprising--those are the character codes for the ASCII table. Starting with character code 128, R begins having some trouble. To demonstrate, I'll use this query that uses undocumented system table spt_values:

SELECT v.number AS [ASCII Code], 
 CHAR(v.number) AS [Char Value]
FROM master..spt_values v
WHERE v.type = 'p'
AND v.number <= 191

Here is the query above passed as the input data set to sp_execute_external_script:

DECLARE @QueryIn NVARCHAR(MAX) = N'
 SELECT v.number, CHAR(v.number)
 FROM master..spt_values v
 WHERE v.type = ''p''
 AND v.number <= 191;';

 EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'OutputDataSet <- InputDataSet',
 @input_data_1 = @QueryIn
WITH RESULT SETS (
 (
  [ASCII Code] INT,
  [Character Value] NCHAR(1)
 )
)

In the output result set, we see character values are interpreted correctly up until ASCII code 128. From there onward, characters are returned.

Dave Mason - sp_execute_external_script - ASCII Codes

Those of you that were watching closely noticed I only selected ASCII codes up to 191. Why did I stop there? If the input data set includes any ASCII codes of 192 or above, we get the "bytecode version mismatch" error:

DECLARE @QueryIn NVARCHAR(MAX) = N'
 SELECT v.number, CHAR(v.number)
 FROM master..spt_values v
 WHERE v.type = ''p''
 AND v.number <= 192;';

 EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'OutputDataSet <- InputDataSet',
 @input_data_1 = @QueryIn
WITH RESULT SETS (
 (
  [ASCII Code] INT,
  [Character Value] NCHAR(1)
 )
)


Msg 39004, Level 16, State 20, Line 15
A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.
Msg 39019, Level 16, State 2, Line 15
An external script error occurred: 
During startup - Warning message:
In setJsonDatabasePath(system.file("extdata/capabilities.json",  :
  bytecode version mismatch; using eval
Error in sort.list(y) : invalid input 'À' in 'utf8towcs'
Calls: data.frame ... as.data.frame -> as.data.frame.character -> factor -> sort.list

Error in execution.  Check the output for more information.
Error in eval(expr, envir, enclos) : 
  Error in execution.  Check the output for more information.
Calls: source -> withVisible -> eval -> eval -> .Call
Execution halted


But wait! There's more...

All of the testing examples I've show thus far have been with R version 3.3.3. This is what got installed when I installed Machine Learning Services with my SQL Server 2017 instance.

EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'dfReturn <- as.data.frame(as.character(version["version.string"]))',
 @output_data_1_name = N'dfReturn';


-----------------------------------
R version 3.3.3 (2017-03-06)

(1 row affected)

On another instance of SQL Server 2017, I was able to upgrade the R version to 3.5.2 (this was more troublesome than I expected--there were many hoops to jump through).

EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'dfReturn <- as.data.frame(as.character(version["version.string"]))',
 @output_data_1_name = N'dfReturn';


-----------------------------------
R version 3.5.2 (2018-12-20)

(1 row affected)

With the newer version of R, I've not been able to reproduce the "bytecode version mismatch" error. ASCII codes from 128 - 255 return a character, while ASCII codes 256 and above return NULL:

DECLARE @QueryIn NVARCHAR(MAX) = N'
 SELECT v.number, CHAR(v.number)
 FROM master..spt_values v
 WHERE v.type = ''p''
 AND v.number <= 2048;';

 EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'OutputDataSet <- InputDataSet',
 @input_data_1 = @QueryIn
WITH RESULT SETS (
 (
  [ASCII Code] INT,
  [Character Value] NCHAR(1)
 )
)

Dave Mason - sp_execute_external_script - Extended ASCII Codes

Mapping types between one system and another can be tricky. Tough choices sometimes have to be made. In closing, I'll simply recommend you cast SQL types CHAR and VARCHAR to NCHAR and NVARCHAR respectively for input data sets to sp_execute_external_script. Happy coding, everyone!


An Alternative to BCP

Dave Mason - rstats

BCP and OPENROWSET are long-lived SQL Server options for working with data in external files. I've blogged about OPENROWSET, including a recent article showing a way to deal with quoted data. One of the shortcomings I've never been able to overcome is an inconsistent data file with data fields in some rows enclosed in double quotes, but not all. I've never found a way around this limitation.

Let's demonstrate with BCP. Below is a sample data file I'll attempt to load into a SQL Server table. Note the data fields highlighted in yellow, which are enclosed in double quotes and contain the field terminator , (comma) character. For reference, the file is also available on Github.

name,phone,DOB,email,SSN,job,company
Stanley Lind,5819626239,2013-07-16,denhj@q.ila,241-63-2894,Chartered loss adjuster,"Mante, Mante and Mante"
Hale Bins,3212496148,1903-02-25,piuylnbstd@gxqshjlek.zfk,206-93-3199,Sports development officer,"Borer, Borer and Borer"
Mikaela O'Keefe,5245629478,1927-09-01,hcpykwmbli@o.qfn,159-33-2513,Museum/gallery conservator,"Jacobs, Jacobs and Jacobs"
Renaldo Kub,3175312457,1918-12-10,ualhdze@xktdvileb.pla,696-91-1436,Comptroller,Jacobs-Jacobs
Keagan Hackett,1374574537,1992-12-01,wicybnlpkr@pgnmy.myq,803-23-7144,Private music teacher,Quitzon LLC
Kiana Cronin m.d.,3297585362,2017-01-24,ctrwj@cpmwtrus.jnx,133-84-8085,Fisheries officer,Witting Ltd
Harlow Durgan,9239627469,1918-04-13,qzfgcrkhno@qijz.que,858-94-5148,"Nurse, mental health",Wilkinson PLC
Sheila Donnelly-Blanda,4657648749,2005-07-19,spmzor@kl.uej,737-58-1638,"Therapist, occupational",Barrows PLC
Mr Colin Turcotte,5634176974,1952-01-18,ksagn@csyarmhd.aev,946-84-5702,Maintenance engineer,Kub-Kub
Lesta Krajcik,6244527243,1921-04-20,b@wmjisa.jec,432-99-2126,Nutritional therapist,White-White

Here is the bcp command:

TITLE "%~n0"
ECHO OFF
COLOR 0E
CLS

bcp tempdb.guest.ImaginaryPeople in "C:\Output\ImaginaryPeople.csv" -S .\SQL2017 -T -c -t, -r "\n" -F 2

pause

And here are the results when we query the newly loaded table. BCP still includes the double quotes as we see on the first three rows, although the , (comma) character is intact and in the right place within the column/field data. However, lines 7 and 8 are wrong. The "job" data is split across two columns/fields. And there's a field terminator , (comma) character now within the "company".

Dave Mason - BCP


An Alternative

Now let's look at an alternative to BCP. Using SQL Server Machine Learning services, we can harness all of the flexibility the R language has to offer (or Python or Java, depending on which version of SQL Server you are using). The code below truncates the table and executes sp_execute_external_script, invoking the R function read.csv().

TRUNCATE TABLE tempdb.guest.ImaginaryPeople;
INSERT INTO tempdb.guest.ImaginaryPeople
EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'  
csvFile <- read.csv(file = "C:\\Output\\ImaginaryPeople.csv", 
                    sep = ",", 
                    header = TRUE,
                    numerals = "no.loss")
csvFile$phone <- as.character(csvFile$phone)',
 @output_data_1_name = N'csvFile';

Here are the query results for the newly loaded table. The double quotes that previously enclosed data columns/fields are gone, all the , (comma) characters are where they should be, and all the field data are in the correct columns.

Dave Mason - sp_execute_external_script - read.csv


Dave's Thoughts

BCP and OPENROWSET have stood the test of time, and rightly so. They're great options. And while you might argue they're imperfect, I will be quick to remind you they are extremely fast. In situations where those tools may not work or are not preferable, there are certainly other options. (SSIS, PowerShell, and C# come to mind.) Maybe R, Python, or Java in-database are other options you'd consider? Comments welcome below. Happy coding!


Pivoting Data Without TSQL

Dave Mason - SQL Server - sp_execute_external_script

Pivoting data in SQL Server is something that comes up infrequently for me. Whenever the need arises, I have to pause and ask myself "What is it I'm trying to do again?". Next I go to the documentation for the T-SQL PIVOT syntax (which I'll never commit to memory) and realize for the umpteenth time the pivoted values have to be hard coded. Then I ponder using dynamic T-SQL because I won't always know the values to pivot at query design time.

If T-SQL isn't a good hammer to PIVOT's nail, perhaps R is. There are different packages that make summarizing and transposing data frames somewhat easy. It can even dynamically pivot unknown values at runtime. But there is a catch, which I'll get to in a bit.

Here's an [AdventureWorks] query showing purchase orders, including the employee that created each PO, the subtotal, and the shipping method.

SELECT 
 p.FirstName + ' ' + p.LastName AS EmployeeName, 
 h.SubTotal, 
 sm.Name AS ShippingMethod
FROM AdventureWorks.Purchasing.PurchaseOrderHeader h
JOIN AdventureWorks.HumanResources.Employee e
 ON e.BusinessEntityID = h.EmployeeID
JOIN AdventureWorks.Person.Person p
 ON p.BusinessEntityID = h.EmployeeID
JOIN AdventureWorks.Purchasing.ShipMethod sm
 ON sm.ShipMethodID = h.ShipMethodID

SQL Server AdventureWorks

While there are likely countless methods and packages that can be used to pivot a data frame in R, I'll focus on the dcast function from the reshape2 package. Here's a small block of code to demonstrate, along with the output:


library(RODBC)
library(reshape2)

dbhandle <- odbcDriverConnect('driver={SQL Server};server=.\\SQL2017;database=AdventureWorks;trusted_connection=true')
res <- sqlQuery(dbhandle, "SELECT 
  p.FirstName + ' ' + p.LastName AS EmployeeName, 
  h.SubTotal, 
  sm.Name AS ShippingMethod
FROM Purchasing.PurchaseOrderHeader h
JOIN HumanResources.Employee e
  ON e.BusinessEntityID = h.EmployeeID
JOIN Person.Person p
  ON p.BusinessEntityID = h.EmployeeID
JOIN Purchasing.ShipMethod sm
  ON sm.ShipMethodID = h.ShipMethodID")
po_sum <- dcast(res, EmployeeName ~ ShippingMethod, value.var = "SubTotal", fun.aggregate = sum)
po_sum

Dave Mason - Microsoft R Client

Most of the R code is a database connection and query. The result set is assigned to the res data frame. The pivoting is done with a single line of code at line 16 that invokes the dcast function. It specifies EmployeeNames as row values, ShippingMethods pivoted as columns, and SubTotals summed. No matter how many different ShippingMethods there are, the dcast function dynamically creates columns for them as needed.

Now, let's see what this looks like using R in-database with sp_execute_external_script:

 DECLARE @QueryIn NVARCHAR(MAX) = N'SELECT 
 p.FirstName + '' '' + p.LastName AS EmployeeName, 
 h.SubTotal, 
 sm.Name AS ShippingMethod
FROM AdventureWorks.Purchasing.PurchaseOrderHeader h
JOIN AdventureWorks.HumanResources.Employee e
 ON e.BusinessEntityID = h.EmployeeID
JOIN AdventureWorks.Person.Person p
 ON p.BusinessEntityID = h.EmployeeID
JOIN AdventureWorks.Purchasing.ShipMethod sm
 ON sm.ShipMethodID = h.ShipMethodID';

 EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'  
library(reshape2)
po_sum <- dcast(purchase_orders, 
  purchase_orders$EmployeeName ~ purchase_orders$ShippingMethod, 
  value.var = "SubTotal", fun.aggregate = sum)',
 @input_data_1 = @QueryIn,
 @input_data_1_name = N'purchase_orders',
 @output_data_1_name = N'po_sum';

Dave Mason - SSMS - sp_execute_external_script

We have our same pivoted result set of 12 rows and 6 columns. But something's missing: the column names. And that's the catch. If you don't use WITH RESULT SETS when invoking sp_execute_external_script, you don't get column names. That's especially important here because the pivoted data values would normally be the column names.

This is one aspect of sp_execute_external_script that vexes me. SQL Server won't infer the column names on its own, even though that meta data is readily available. R has a lot of potential for developers and DBAs, even though we may not be data scientists per se. I'm hoping Microsoft will address this shortcoming in a future Cumulative Update in all versions from SQL 2016 forward.

And maybe you can help! If you would, please upvote this customer feedback item. Thanks for reading...and thanks for voting!


Generating Fake Data in SQL Server With R

SQL Server - sp_execute_external_script - R

I've been thinking about R and how it can be used by developers, DBAs, and other SQL Server professionals that aren't data scientists per se. A recent article about generating a data set of fake transactional data got me thinking about this again and I wondered, can R be used to obfuscate PII data?

In a word, yes. Well, mostly. (More on this in a bit.) As with anything R-related, there are probably multiple packages that are useful for any given task. For this one, I'll focus on the "generator" package. It includes the following functions (whose names should be fairly self-explanatory) for generating fake data:

  • r_credit_card_numbers
  • r_date_of_births
  • r_email_addresses
  • r_full_names
  • r_ipv4_addresses
  • r_latitudes
  • r_longitudes
  • r_national_identification_numbers
  • r_phone_numbers

All of the functions have a required n parameter for the number of observations to generate. For dates of birth and phone numbers, there are some additional optional parameters for customization. Here's an example (using the functions I find most useful) via SQL Server's Machine Learning Service and sp_execute_external_script:

EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(generator)
num_observations <- 100
df <- as.data.frame(r_full_names(num_observations))
df$phone <- r_phone_numbers(num_observations, use_hyphens = TRUE)
df$DOB <- as.character(r_date_of_births(num_observations))
df$email <- r_email_addresses(num_observations)
df$SSN <- r_national_identification_numbers(num_observations)
OutputDataSet <- df'
WITH RESULT SETS ((
 FullName VARCHAR(255),
 PhoneNumber VARCHAR(16),
 DOB DATE,
 EmailAddress VARCHAR(255),
 SSN CHAR(11)
));

R generator package


Dave's Thoughts

This presents a really interesting option, although there are many questions I haven't broached that are outside the scope of this post. Generating random dates and digits is fairly trivial with T-SQL. Generating random (and meaningful) strings is not. Those fake names are pretty cool, although if you are outside of the US, the diversity of those names (or lack of) may be a concern. The email addresses are not realistic in my opinion. They look too random, as you can see by looking at a few of the top-level domain name suffixes. (.lhw? .nxa?)

There's probably not enough functions in the generator package to roll your own complete PII obfuscation solution. However, it appears to be version 1.0. The documentation notes functions for home address and birthplace (among others) are "To be added". If those functions arrive any time soon, I'll be considering using "generator" for some non-production environments.

Are there any other R packages you know of for generating fake PII? Leave a message in the comments!