2017-12-13

Machine Learning Services and Result Sets part 3

Dave Mason SQL Server Machine Learning Services sp_execute_external_script WITH RESULT SETS

To quickly recap the last couple of posts, the set of data returned by sp_execute_external_script has no column names by default. The column names (and data types) can be defined by using the WITH RESULT SETS options of the EXECUTE statement. But you have to know the result sets definition (number of columns, column names, and data types) at design time (before running the query). And that's a shame: there are column names for R and Python data frames, but SQL Server doesn't have a built-in way to access that metadata dynamically at runtime.

Perhaps in a future version of SQL Server, there will be a new parameter for WITH RESULTS SETS or a way to specify the result sets definition without having to hard code it. Until then, we have to do without, or get creative. Let's take a look at one possible solution.

Here's some R code that scrapes a web page and assigns data for the first HTML table into data frame df.

library(rvest)
library(dplyr)
hDoc <- read_html("https://www.usatoday.com/sports/nba/standings/2017/league/")
df <- html_table(hDoc, header = TRUE, fill = TRUE)[[1]]
head(df)
                     TEAM  W  L  PCT   GB   HOME   ROAD   CONF   PF   PA DIFF L-10 STRK
1         Houston Rockets 21  4 0.84    -  9 - 3 12 - 1 12 - 2 2876 2601  275 10-0   10
2          Boston Celtics 23  6 0.79    - 12 - 2 11 - 4 15 - 5 2994 2822  172  7-3    1
3   Golden State Warriors 22  6 0.79  0.5  9 - 3 13 - 3 10 - 4 3276 2968  308  9-1    7
4     Cleveland Cavaliers 20  8 0.71  2.5 11 - 4  9 - 4 16 - 6 3107 3008   99  9-1    2
5       San Antonio Spurs 19  9 0.68  3.5 13 - 2  6 - 7  9 - 5 2830 2731   99  8-2    1
6         Toronto Raptors 17  8 0.68  4.0  9 - 1  8 - 7  8 - 4 2763 2580  183  7-3    1

The "df" data frame columns have names and types:

library(rvest)
library(dplyr)
hDoc <- read_html("https://www.usatoday.com/sports/nba/standings/2017/league/")
df <- html_table(hDoc, header = TRUE, fill = TRUE)[[1]]

tableMetadata <- data.frame(colName = names(df),
                          colType = sapply(df, class),
                          row.names = NULL)
tableMetadata

   colName   colType
1     TEAM character
2        W   integer
3        L   integer
4      PCT   numeric
5       GB character
6     HOME character
7     ROAD character
8     CONF character
9       PF   integer
10      PA   integer
11    DIFF   integer
12    L-10 character
13    STRK   integer

We can take those names and R types, string them together, and "convert" them to SQL data types. (Mapping data types from one language to another is waaaay outside the scope of this post. Lines 11-13 are quick and dirty, just for demonstration purposes. Okie dokie?)

library(rvest)
library(dplyr)
hDoc <- read_html("https://www.usatoday.com/sports/nba/standings/2017/league/")
df <- html_table(hDoc, header = TRUE, fill = TRUE)[[1]]

tableMetadata <- data.frame(colName = names(df),
                          colType = sapply(df, class),
                          row.names = NULL)

tableMetadataMutated <- tableMetadata %>% mutate(SqlDataType =
    ifelse(colType == "character", "NVARCHAR(1024)",
    ifelse(colType == "integer", "BIGINT",
    ifelse(colType == "numeric", "NUMERIC(35, 10)",
    NA))))

rResultSet <- paste("[", tableMetadataMutated$colName, "]", " ",
    tableMetadataMutated$SqlDataType,
    ", ",
    sep = "",
    collapse = "")
rResultSet <- substr(rResultSet, 1, nchar(rResultSet) - 2)
rResultSet

[1] "[TEAM] NVARCHAR(1024), [W] BIGINT, [L] BIGINT, [PCT] NUMERIC(35, 10), [GB] NVARCHAR(1024), [HOME] NVARCHAR(1024), [ROAD] NVARCHAR(1024), [CONF] NVARCHAR(1024), [PF] BIGINT, [PA] BIGINT, [DIFF] BIGINT, [L-10] NVARCHAR(1024), [STRK] BIGINT"

When you look at the output, do you see where we're headed? Let's switch gears to T-SQL. Below is a script. Here are a few notes:

  • The R script code is assigned to the T-SQL @Rscript variable (lines 1-21).
  • At line 24, sp_execute_external_script is executed.
  • rResultSet in the R script will contain the string of column names and SQL data types (lines 15-20).
  • The value of rResultSet in R is passed to T-SQL variable @WithResultSets (line 29).
  • We don't want a dataset returned, so on line 27 we specify the data frame name "none", which doesn't exist in the R script.
  • At line 32 we declare a @Tsql variable to store our dynamically created T-SQL string.
  • The string builds an EXEC sp_execute_external_script statement (line 33).
  • The R script is reused (line 35).
  • WITH RESULT SETS is specified by way of the @WithResultSets T-SQL variable.
  • Lastly, the T-SQL string is executed (line 39).
  • This time, the @output_data_1_name is *not* specified. The R dataframe named OutputDataSet (line 5) is returned to SQL Server by default.

DECLARE @Rscript NVARCHAR(MAX) = '
library(rvest)
library(dplyr)
hDoc <- read_html("https://www.usatoday.com/sports/nba/standings/2017/league/")
OutputDataSet <- html_table(hDoc, header = TRUE, fill = TRUE)[[1]]

dfMetadata<-data.frame(colName = names(OutputDataSet), 
                          colType = sapply(OutputDataSet, class), 
                          row.names = NULL)
dfMetadataMutated <- dfMetadata %>% mutate(SqlDataType =  
          ifelse(colType == "character", "NVARCHAR(1024)", 
          ifelse(colType == "integer", "BIGINT",
          ifelse(colType == "numeric", "NUMERIC(35, 10)",
          NA))))
rResultSet <- paste("[", dfMetadataMutated$colName, "]", " ", 
      dfMetadataMutated$SqlDataType, 
      ", ", 
      sep="", 
      collapse="")
rResultSet <- substr(rResultSet, 1, nchar(rResultSet) - 2)
';
DECLARE @WithResultSets NVARCHAR(MAX);

EXEC sp_execute_external_script 
 @language = N'R',   
    @script = @Rscript,
 @output_data_1_name = N'none',
 @params = N'@rResultSet NVARCHAR(MAX) OUTPUT', 
 @rResultSet = @WithResultSets OUTPUT;
PRINT @WithResultSets;

DECLARE @Tsql NVARCHAR(MAX) = '
EXEC sp_execute_external_script 
 @language = N''R'',   
    @script = N''' + @Rscript + '''
 WITH RESULT SETS ((' + @WithResultSets + '));
';

EXEC (@Tsql);

Here is the SSMS output:

Dave Mason SQL Server Machine Learning Services sp_execute_external_script WITH RESULT SETS

In the end, I'd rather Microsoft give us a feature enhancement making this type of solution unnecessary. I have my doubts that this type of hack would be well received by other developers. It's a little convoluted, and the performance has got to suck since the R code has to run twice. Welcome to the world of T-SQL development.


SHARE

No comments :

Post a Comment