2017-11-16

Machine Learning Services and Result Sets part 2

Dave Mason SQL Server Machine Learning Services WITH RESULT SETS R Python

In the last post, I pondered the use of WITH RESULT SETS along with sp_execute_external_script when we don't know the structure of the OutputDataSet until run time. Let's look at another example. Here I'll use the rvest package to scrape a web page: each HTML <table> found on the web page becomes a data frame in the "tables" list (line 5). The first data frame in the list is returned to SQL Server (line 6).

EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(rvest)
hDoc <- read_html("https://www.usatoday.com/sports/nba/standings/2017/league/")
tables <- html_table(hDoc, header = TRUE, fill = TRUE)
OutputDataSet <- tables[[1]]';


As expected, when we execute the code in SSMS, there are no column names in our output:


The lack of column names is disappointing. Let's see what happens when I run the equivalent code in Visual Studio (or RStudio):

library(rvest)
hDoc <- read_html("https://www.usatoday.com/sports/nba/standings/2017/league/")
tables <- html_table(hDoc, header = TRUE, fill = TRUE)
tables[[1]]
                     TEAM  W  L  PCT   GB  HOME  ROAD   CONF   PF   PA DIFF L-10 STRK
1          Boston Celtics 13  2 0.87    - 6 - 1 7 - 1  9 - 2 1540 1418  122 10-0   13
2   Golden State Warriors 11  3 0.79  1.5 6 - 2 5 - 1  6 - 2 1674 1508  166  9-1    7
3         Houston Rockets 11  4 0.73  2.0 4 - 3 7 - 1  5 - 2 1668 1569   99  7-3    1
4         Detroit Pistons 10  4 0.71  2.5 7 - 1 3 - 3  6 - 3 1480 1415   65  8-2    1
5         Toronto Raptors  9  5 0.64  3.5 4 - 1 5 - 4  3 - 2 1559 1481   78  7-3    2
6      Washington Wizards  9  5 0.64  3.5 5 - 3 4 - 2  5 - 1 1550 1470   80  6-4    4
7  Minnesota Timberwolves  9  5 0.64  3.5 5 - 1 4 - 4  7 - 3 1512 1525  -13  7-3    2
8       San Antonio Spurs  9  6 0.60  4.0 6 - 2 3 - 4  4 - 2 1522 1484   38  5-5    1
9          Denver Nuggets  8  6 0.57  4.5 6 - 2 2 - 4  2 - 3 1481 1467   14  7-3    1
10        New York Knicks  8  6 0.57  4.5 7 - 3 1 - 3  4 - 4 1484 1465   19  7-3    1

The columns in the data frame clearly have names, but SQL Server isn't using them. The data frame columns have types in R too (more on this in a moment). Now that makes me wonder about the data types for the data returned by SQL. How is that determined? If SQL isn't using the column names, can I assume it isn't making use of the R column types either?

For a point of reference, let's run some more R code to show the column names and types. As before, the rvest package is used to scrape a web page, with each HTML <table> found becoming a data frame in the "tables" list (line 3). A data frame of table metadata is created by calling data.frame(). The first parameter is a vector of column names (line 4), the second parameter is a vector of column classes (line 5), and the third parameter causes the row "names" to be incrementing digits (line 6).

UPDATE
I've updated this post with revised R code below. The previous version used an iterative/looping structure, which I've replaced with help from the sapply function. The resulting code is smaller, hopefully easier to read, and perhaps runs faster too.

library(rvest)
hDoc <- read_html("https://www.usatoday.com/sports/nba/standings/2017/league/")
tables <- html_table(hDoc, header = TRUE, fill = TRUE)
tableMetadata <- data.frame(colName = names(tables[[1]]),
                          colType = sapply(tables[[1]], class),
                          row.names = NULL)
tableMetadata
   colName   colType
1     TEAM character
2        W   integer
3        L   integer
4      PCT   numeric
5       GB character
6     HOME character
7     ROAD character
8     CONF character
9       PF   integer
10      PA   integer
11    DIFF   integer
12    L-10 character
13    STRK   integer

We can see the "PCT" column is a numeric R type, while all of the others are either character or integer types.

Now back to the SQL Server data types. Let's run a little bit of C# code to find out what SQL Server is returning to us. Here, we open a connection to SQL , execute the dbo.ScrapeWebPage stored procedure (for simplicity, I created dbo.ScrapeWebPage with hard-coded parameter values for sp_execute_external_script--it's on the third tab), iterate through the columns of the first record, and output the name and datatype of each column.

using System;
using System.Data.SqlClient;

namespace Rtesting
{
    class Program
    {
        static void Main(string[] args)
        {
            using (SqlConnection conn = GetSqlConnection())
            {
                using (SqlCommand cmd = new SqlCommand())
                {
                    //Run stored proc with no parameters.
                    cmd.Connection = conn;
                    cmd.CommandType = System.Data.CommandType.StoredProcedure;
                    cmd.CommandText = "dbo.ScrapeWebPage";

                    //Execute the SP.
                    using (SqlDataReader dr = cmd.ExecuteReader())
                    {
                        //Read the first row.
                        if (dr.Read())
                        {
                            //Iterate through the columns.
                            for(int i = 0; i < dr.FieldCount; i++)
                            {
                                string columnName = dr.GetName(i);
                                string columnType = dr.GetDataTypeName(i);

                                if (string.IsNullOrEmpty(columnName))
                                    columnName = "Not found";

                                Console.Write(columnName);
                                Console.WriteLine(columnType.PadLeft(15));
                            }
                        }
                    }
                }
            }
        }

        static SqlConnection GetSqlConnection()
        {
            SqlConnection conn = new SqlConnection();
            SqlConnectionStringBuilder csb = new SqlConnectionStringBuilder();

            csb.DataSource = ".\\Sql2017";
            csb.InitialCatalog = "Rtesting";
            csb.IntegratedSecurity = true;
            conn.ConnectionString = csb.ConnectionString;
            conn.Open();
            return conn;
        }
    }
}
Not found      nvarchar
Not found      int
Not found      int
Not found      float
Not found      nvarchar
Not found      nvarchar
Not found      nvarchar
Not found      nvarchar
Not found      int
Not found      int
Not found      int
Not found      nvarchar
Not found      int
CREATE OR ALTER PROCEDURE dbo.ScrapeWebPage
AS
 EXEC sp_execute_external_script 
  @language = N'R',
  @script = N'library(rvest)
 #library(stringr)
 hDoc <- read_html("https://www.usatoday.com/sports/nba/standings/2017/league/")
 tables <- html_table(hDoc, header = TRUE, fill = TRUE)
 OutputDataSet <- tables[[1]]';
GO

I had wondered if SQL was just throwing everything back at me as NVARCHAR, but we can see that's not the case. The R numeric type is returned as a SQL float, R character types as SQL nvarchar, and R integer types AS SQL int. Are the R types getting mapped directly to SQL Server data types? Or is that a coincidence? Maybe SQL is simply examining the data values and determining column data types from that. Either way, we are getting some semblance of proper data types returned back even without specifying WITH RESULT SETS. The data types may not be the best choices. For instance, all of the INT types could have been SMALLINT or even TINYINT. And I would have preferred DECIMAL/NUMERIC to FLOAT. Still, it's a decent effort. But we're still missing those column names, which are available on the R side of the equation.

This post ended being much longer than I anticipated. In my next post, I'll wrap things up and show my solution for getting column names and data types dynamically at run time. Until then, happy coding!


SHARE

No comments :

Post a Comment