2017-12-13

Machine Learning Services and Result Sets part 3

Dave Mason SQL Server Machine Learning Services sp_execute_external_script WITH RESULT SETS

To quickly recap the last couple of posts, the set of data returned by sp_execute_external_script has no column names by default. The column names (and data types) can be defined by using the WITH RESULT SETS options of the EXECUTE statement. But you have to know the result sets definition (number of columns, column names, and data types) at design time (before running the query). And that's a shame: there are column names for R and Python data frames, but SQL Server doesn't have a built-in way to access that metadata dynamically at runtime.

Perhaps in a future version of SQL Server, there will be a new parameter for WITH RESULTS SETS or a way to specify the result sets definition without having to hard code it. Until then, we have to do without, or get creative. Let's take a look at one possible solution.

Here's some R code that scrapes a web page and assigns data for the first HTML table into data frame df.

library(rvest)
library(dplyr)
hDoc <- read_html("https://www.usatoday.com/sports/nba/standings/2017/league/")
df <- html_table(hDoc, header = TRUE, fill = TRUE)[[1]]
head(df)
                     TEAM  W  L  PCT   GB   HOME   ROAD   CONF   PF   PA DIFF L-10 STRK
1         Houston Rockets 21  4 0.84    -  9 - 3 12 - 1 12 - 2 2876 2601  275 10-0   10
2          Boston Celtics 23  6 0.79    - 12 - 2 11 - 4 15 - 5 2994 2822  172  7-3    1
3   Golden State Warriors 22  6 0.79  0.5  9 - 3 13 - 3 10 - 4 3276 2968  308  9-1    7
4     Cleveland Cavaliers 20  8 0.71  2.5 11 - 4  9 - 4 16 - 6 3107 3008   99  9-1    2
5       San Antonio Spurs 19  9 0.68  3.5 13 - 2  6 - 7  9 - 5 2830 2731   99  8-2    1
6         Toronto Raptors 17  8 0.68  4.0  9 - 1  8 - 7  8 - 4 2763 2580  183  7-3    1

The "df" data frame columns have names and types:

library(rvest)
library(dplyr)
hDoc <- read_html("https://www.usatoday.com/sports/nba/standings/2017/league/")
df <- html_table(hDoc, header = TRUE, fill = TRUE)[[1]]

tableMetadata <- data.frame(colName = names(df),
                          colType = sapply(df, class),
                          row.names = NULL)
tableMetadata

   colName   colType
1     TEAM character
2        W   integer
3        L   integer
4      PCT   numeric
5       GB character
6     HOME character
7     ROAD character
8     CONF character
9       PF   integer
10      PA   integer
11    DIFF   integer
12    L-10 character
13    STRK   integer

We can take those names and R types, string them together, and "convert" them to SQL data types. (Mapping data types from one language to another is waaaay outside the scope of this post. Lines 11-13 are quick and dirty, just for demonstration purposes. Okie dokie?)

library(rvest)
library(dplyr)
hDoc <- read_html("https://www.usatoday.com/sports/nba/standings/2017/league/")
df <- html_table(hDoc, header = TRUE, fill = TRUE)[[1]]

tableMetadata <- data.frame(colName = names(df),
                          colType = sapply(df, class),
                          row.names = NULL)

tableMetadataMutated <- tableMetadata %>% mutate(SqlDataType =
    ifelse(colType == "character", "NVARCHAR(1024)",
    ifelse(colType == "integer", "BIGINT",
    ifelse(colType == "numeric", "NUMERIC(35, 10)",
    NA))))

rResultSet <- paste("[", tableMetadataMutated$colName, "]", " ",
    tableMetadataMutated$SqlDataType,
    ", ",
    sep = "",
    collapse = "")
rResultSet <- substr(rResultSet, 1, nchar(rResultSet) - 2)
rResultSet

[1] "[TEAM] NVARCHAR(1024), [W] BIGINT, [L] BIGINT, [PCT] NUMERIC(35, 10), [GB] NVARCHAR(1024), [HOME] NVARCHAR(1024), [ROAD] NVARCHAR(1024), [CONF] NVARCHAR(1024), [PF] BIGINT, [PA] BIGINT, [DIFF] BIGINT, [L-10] NVARCHAR(1024), [STRK] BIGINT"

When you look at the output, do you see where we're headed? Let's switch gears to T-SQL. Below is a script. Here are a few notes:

  • The R script code is assigned to the T-SQL @Rscript variable (lines 1-21).
  • At line 24, sp_execute_external_script is executed.
  • rResultSet in the R script will contain the string of column names and SQL data types (lines 15-20).
  • The value of rResultSet in R is passed to T-SQL variable @WithResultSets (line 29).
  • We don't want a dataset returned, so on line 27 we specify the data frame name "none", which doesn't exist in the R script.
  • At line 32 we declare a @Tsql variable to store our dynamically created T-SQL string.
  • The string builds an EXEC sp_execute_external_script statement (line 33).
  • The R script is reused (line 35).
  • WITH RESULT SETS is specified by way of the @WithResultSets T-SQL variable.
  • Lastly, the T-SQL string is executed (line 39).
  • This time, the @output_data_1_name is *not* specified. The R dataframe named OutputDataSet (line 5) is returned to SQL Server by default.

DECLARE @Rscript NVARCHAR(MAX) = '
library(rvest)
library(dplyr)
hDoc <- read_html("https://www.usatoday.com/sports/nba/standings/2017/league/")
OutputDataSet <- html_table(hDoc, header = TRUE, fill = TRUE)[[1]]

dfMetadata<-data.frame(colName = names(OutputDataSet), 
                          colType = sapply(OutputDataSet, class), 
                          row.names = NULL)
dfMetadataMutated <- dfMetadata %>% mutate(SqlDataType =  
          ifelse(colType == "character", "NVARCHAR(1024)", 
          ifelse(colType == "integer", "BIGINT",
          ifelse(colType == "numeric", "NUMERIC(35, 10)",
          NA))))
rResultSet <- paste("[", dfMetadataMutated$colName, "]", " ", 
      dfMetadataMutated$SqlDataType, 
      ", ", 
      sep="", 
      collapse="")
rResultSet <- substr(rResultSet, 1, nchar(rResultSet) - 2)
';
DECLARE @WithResultSets NVARCHAR(MAX);

EXEC sp_execute_external_script 
 @language = N'R',   
    @script = @Rscript,
 @output_data_1_name = N'none',
 @params = N'@rResultSet NVARCHAR(MAX) OUTPUT', 
 @rResultSet = @WithResultSets OUTPUT;
PRINT @WithResultSets;

DECLARE @Tsql NVARCHAR(MAX) = '
EXEC sp_execute_external_script 
 @language = N''R'',   
    @script = N''' + @Rscript + '''
 WITH RESULT SETS ((' + @WithResultSets + '));
';

EXEC (@Tsql);

Here is the SSMS output:

Dave Mason SQL Server Machine Learning Services sp_execute_external_script WITH RESULT SETS

In the end, I'd rather Microsoft give us a feature enhancement making this type of solution unnecessary. I have my doubts that this type of hack would be well received by other developers. It's a little convoluted, and the performance has got to suck since the R code has to run twice. Welcome to the world of T-SQL development.

2017-11-16

Machine Learning Services and Result Sets part 2

Dave Mason SQL Server Machine Learning Services WITH RESULT SETS R Python

In the last post, I pondered the use of WITH RESULT SETS along with sp_execute_external_script when we don't know the structure of the OutputDataSet until run time. Let's look at another example. Here I'll use the rvest package to scrape a web page: each HTML <table> found on the web page becomes a data frame in the "tables" list (line 5). The first data frame in the list is returned to SQL Server (line 6).

2017-11-15

Machine Learning Services and Result Sets

Dave Mason SQL Server Machine Learning Services WITH RESULT SETS R Python

When you run an R or Python script in-database via the sp_execute_external_script stored procedure, result sets returned by the stored procedure are output with unnamed columns by default. Here's an example using R and the [WideWorldImporters] database. The input data is a simple query on the Application.Countries table. R creates a data frame from the input data and merely returns it back to SQL Server (as the "OutputDataSet" data frame).

2017-11-14

T-SQL Tuesday #96: Folks Who Have Made a Difference

Dave Mason T-SQL Tuesday

Last week while reviewing some of my old posts, I realized I had written 99 of them. For my 100th post, I asked for suggestions from the Twitterverse. There were some really great responses, including some that focused on the journey to 100: what I learned, who helped and influenced me along the way, etc. I was going to go in another direction, but this really dovetails nicely with the T-SQL Tuesday topic for November 2017.

2017-11-09

The PASS Website and Social Media Promotion

I haven't been involved in the SQL community for very long. Because of that, I haven't done much volunteering. But earlier this year, Andy Warren asked me to help out with the organization efforts for SQLSaturday Orlando 2017. Andy is a veritable fountain of knowledge when it comes to SQLSaturdays. He has a ton of great ideas on how to improve things for everyone involved: attendees, sponsors, speakers, and organizers. Seriously, check out his recent posts to see what I mean!

2017-11-06

Multiple Output Datasets With R and SQL Server

Dave Mason SQL Server R

Something I've been fooling around with lately is using R to scrape web pages for HTML table data. As an example, there are two HTML tables on this post about unpivoting the output of a RESTORE HEADER command in T-SQL. Basically, there's a "wide" HTML table with dozens of columns, and a "narrow" HTML table with just two.

2017-10-10

XE Profiler - Initial Thoughts

SSMS 17.3 was recently released. There's a raft of new items in the release. That includes "XE Profiler", which is what I want to focus on.

When you open SSMS 17.3 and navigate to the Object Explorer window, at the bottom you'll see the new "XE Profiler" node. Expand it and there are two items: "Standard" and "TSQL". Navigate back up the tree to Management, expand it, find "Extended Events", expand it, then expand "Sessions". You should also see two new Extended Events sessions: "QuickSessionStandard" and "QuickSessionTSQL".

2017-10-05

Visual Studio: Create a Python Environment from SQL Server 2017 Machine Learning Services

Dave Mason SQL Server 2017 Visual Studio Python

I'm starting to experiment with Python scripts in SQL Server 2017 using Machine Learning Services (In-Database). The problem is, I don't know Python. If I run into a Python error, the output I get from SSMS is not looking too helpful. My instincts tell me I'll be much better off developing and debugging Python code from a development tool. What I settled on was to use Visual Studio along with the Python interpreter that comes with SQL Server 2017 Machine Learning Services. I ran into a few issues that I'll review here.

2017-10-02

Regular Expressions With R And T-SQL

Dave Mason SQL Server R RegEx

Have you ever had the need to use Regular Expressions directly in SQL Server? I sometimes hear or see others refer to using RegEx in TSQL. But I always assume they're talking about the TSQL LIKE operator, because RegEx isn't natively supported. In TSQL's defence, you can get a lot of mileage out of LIKE and some clever pattern matching strings, even though it's not authentic RegEx.

2017-09-29

The End Of Tanking?

NBA Draft Lottery

As you may have heard, the NBA Board of Governors approved a revision to the Lottery Format for 2019. Many reports on this activity are proclaiming the incentive to "tank" has been eradicated and that the tanking problem has been solved. This couldn't be farther from the truth.