2017-11-06

Multiple Output Datasets With R and SQL Server

Dave Mason SQL Server R

Something I've been fooling around with lately is using R to scrape web pages for HTML table data. As an example, there are two HTML tables on this post about unpivoting the output of a RESTORE HEADER command in T-SQL. Basically, there's a "wide" HTML table with dozens of columns, and a "narrow" HTML table with just two.

Here is some initial R code I came up with (and the output) to nab the HTML table data as data frames. Keep in mind I am still very new to R. What I've written may be "clumsy or random". Plus, it's been brought to my attention that the rvest package is superior to the methods I've used. But the web scraping isn't what inspired this post.


#install.packages('XML')
#install.packages('RCurl')

library(RCurl)
library(XML)
url <- 'https://itsalljustelectrons.blogspot.com/2017/02/Unpivot-Output-Of-RESTORE-HEADERONLY.html'
tables = getURL(url)
tables <- readHTMLTable(tables, header = TRUE, as.data.frame = TRUE, stringsAsFactors = F)
tables

$HO
                                      BackupName BackupDescription BackupType
1 SomeApp9_DatabaseTemplate-Full Database Backup              NULL          1
  ExpirationDate Compressed Position DeviceType             UserName
1           NULL          0        1          2 SomeVendor\\SRHadden
              ServerName              DatabaseName DatabaseVersion
1 SomeSqlHost\\SQL2008R2 SomeApp9_DatabaseTemplate             661
     DatabaseCreationDate BackupSize             FirstLSN              LastLSN
1 2014-01-02 09:13:15.000   31781888 10993000000463100036 10993000000464800001
         CheckpointLSN    DatabaseBackupLSN         BackupStartDate
1 10993000000463100036 10993000000413800256 2015-09-14 13:18:44.000
         BackupFinishDate SortOrder CodePage UnicodeLocaleId
1 2015-09-14 13:18:46.000        52        0            1033
  UnicodeComparisonStyle CompatibilityLevel SoftwareVendorId
1                 196609                 90             4608
  SoftwareVersionMajor SoftwareVersionMinor SoftwareVersionBuild MachineName
1                   10                   50                 1617 SomeSqlHost
  Flags                            BindingID
1   512 62542A27-94A4-4010-905A-32AEFA4016AB
                        RecoveryForkID                    Collation
1 FC58F82C-0D2C-4D7F-9225-3DF679BBF636 SQL_Latin1_General_CP1_CI_AS
                            FamilyGUID HasBulkLoggedData IsSnapshot IsReadOnly
1 25E2953C-BAAA-4A41-9E08-B0E4D82B7088                 0          0          0
  IsSingleUser HasBackupChecksums IsDamaged BeginsLogChain HasIncompleteMetaData
1            0                  0         0              0                     0
  IsForceOffline IsCopyOnly                  FirstRecoveryForkID ForkPointLSN
1              0          0 FC58F82C-0D2C-4D7F-9225-3DF679BBF636         NULL
  RecoveryModel DifferentialBaseLSN DifferentialBaseGUID BackupTypeDescription
1        SIMPLE                NULL                 NULL              Database
                         BackupSetGUID CompressedBackupSize Containment
1 AC2D9342-692E-4D05-B321-16274E6040C3             31781888           0

$HO
                    Field                                          Value
1              BackupName SomeApp9_DatabaseTemplate-Full Database Backup
2       BackupDescription                                           NULL
3              BackupType                                              1
4          ExpirationDate                                           NULL
5              Compressed                                              0
6                Position                                              1
7              DeviceType                                              2
8                UserName                           SomeVendor\\SRHadden
9              ServerName                         SomeSqlHost\\SQL2008R2
10           DatabaseName                      SomeApp9_DatabaseTemplate
11        DatabaseVersion                                            661
12   DatabaseCreationDate                            Jan  2 2014  9:13AM
13             BackupSize                                       31781888
14               FirstLSN                           10993000000463100036
15                LastLSN                           10993000000464800001
16          CheckpointLSN                           10993000000463100036
17      DatabaseBackupLSN                           10993000000413800256
18        BackupStartDate                            Sep 14 2015  1:18PM
19       BackupFinishDate                            Sep 14 2015  1:18PM
20              SortOrder                                             52
21               CodePage                                              0
22        UnicodeLocaleId                                           1033
23 UnicodeComparisonStyle                                         196609
24     CompatibilityLevel                                             90
25       SoftwareVendorId                                           4608
26   SoftwareVersionMajor                                             10
27   SoftwareVersionMinor                                             50
28   SoftwareVersionBuild                                           1617
29            MachineName                                    SomeSqlHost
30                  Flags                                            512
31              BindingID           62542A27-94A4-4010-905A-32AEFA4016AB
32         RecoveryForkID           FC58F82C-0D2C-4D7F-9225-3DF679BBF636
33              Collation                   SQL_Latin1_General_CP1_CI_AS
34             FamilyGUID           25E2953C-BAAA-4A41-9E08-B0E4D82B7088
35      HasBulkLoggedData                                              0
36             IsSnapshot                                              0
37             IsReadOnly                                              0
38           IsSingleUser                                              0
39     HasBackupChecksums                                              0
40              IsDamaged                                              0
41         BeginsLogChain                                              0
42  HasIncompleteMetaData                                              0
43         IsForceOffline                                              0
44             IsCopyOnly                                              0
45    FirstRecoveryForkID           FC58F82C-0D2C-4D7F-9225-3DF679BBF636
46           ForkPointLSN                                           NULL
47          RecoveryModel                                         SIMPLE
48    DifferentialBaseLSN                                           NULL
49   DifferentialBaseGUID                                           NULL
50  BackupTypeDescription                                       Database
51          BackupSetGUID           AC2D9342-692E-4D05-B321-16274E6040C3
52   CompressedBackupSize                                       31781888
53            containment                                              0

$`NULL`
NULL

$`NULL`
  CONNECT

Ok, so the rookie web scraping R code notwithstanding, I've got the beginnings of something here (I think). Now what? Getting the output data from R into a SQL Server table sounds like a logical next step. But I've got a couple of issues.

The first is that I was expecting two HTML tables, but there are actually four of them. HTML table #3 is null--it's probably the container for the "SHARE" buttons towards the bottom of the page, above the comments. And HTML table #4, while not null, consists of one "empty" row--it's the "CONNECT" area in the upper-right part of the page. Hey! While you're checking out those two sections, feel free to share this article or connect with me on Twitter or LinkedIn! 😊 By the way, you likely won't see those sections unless you're using a desktop browser. Mobile devices render the page differently.

The second issue is that sp_execute_external_script returns a single data frame as output. The R code returns a list (or vector?) that may contain multiple data frames. I could just return the first one. But that kinda sucks. How can I grab them all?

Let's start by only returning one data frame. I can change line #9 above to return just the first item in the list:

tables[[1]]


Now I can run that R code in SQL Server as follows:

--Return first html table.
EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(RCurl)
library(XML)
url <- ''https://itsalljustelectrons.blogspot.com/2017/02/Unpivot-Output-Of-RESTORE-HEADERONLY.html'';
tables = getURL(url);
tables <- readHTMLTable(tables, header = TRUE, as.data.frame = TRUE, stringsAsFactors = F);
OutputDataSet <- tables[[1]]'
WITH RESULT SETS UNDEFINED;
Here is the partial output in SSMS:
Dave Mason SQL Server R sp_execute_external_script


I could also run sp_execute_external_script multiple times--once for each data frame. Here's an example with data frame #1 and #2:

--Return first html table.
EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(RCurl)
library(XML)
url <- ''https://itsalljustelectrons.blogspot.com/2017/02/Unpivot-Output-Of-RESTORE-HEADERONLY.html'';
tables = getURL(url);
tables <- readHTMLTable(tables, header = TRUE, as.data.frame = TRUE, stringsAsFactors = F);
OutputDataSet <- tables[[1]]'
WITH RESULT SETS UNDEFINED;

--Return second html table.
EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(RCurl)
library(XML)
url <- ''https://itsalljustelectrons.blogspot.com/2017/02/Unpivot-Output-Of-RESTORE-HEADERONLY.html'';
tables = getURL(url);
tables <- readHTMLTable(tables, stringsAsFactors = F);
OutputDataSet <- tables[[2]]'
WITH RESULT SETS UNDEFINED;
Again, here is the partial output in SSMS:
Dave Mason SQL Server R sp_execute_external_script


Of course for this strategy to work, I'd have to know ahead of time how many data frames/HTML tables there are. Hmmm. Can dynamic T-SQL help me here? If I could find out at run time how many data frames there are, and which ones I may or may not want, then why not? Here's some R code that reads HTML tables into a variable as a list of data frames(line 8), iterates through the list (starting at line 18), decides if the HTML table has any data in it (lines 21, 24), and adds the HTML table number (the element number in the list) to a different data frame (line 27). The output shows us we would want HTML tables 1, 2, and 4. (Yeah, I really didn't want #4. But that can be fixed by enhancing the R code to be more selective. Let's just go with it for now.)


#install.packages('XML')
#install.packages('RCurl')

library(RCurl)
library(XML)
url <- 'https://itsalljustelectrons.blogspot.com/2017/02/Unpivot-Output-Of-RESTORE-HEADERONLY.html'
tables = getURL(url)
tables <- readHTMLTable(tables, header = TRUE, as.data.frame = TRUE, stringsAsFactors = F)
#tables

#Create an "empty" data frame, with a single "column" of numbers.
retDataFrameIndexes <- data.frame(rowIndex = numeric())

#Were any html tables found?
if (length(tables) > 0) {
 #Iterate through list.
 for (i in 1:length(tables)) {
  #Make sure the data frame isn't null...
  if (!is.null(tables[[i]])) {
   #...and that it has rows.
   if (nrow(tables[[i]]) > 0) {
    #Add a "row" to our data frame.
    retDataFrameIndexes <- rbind(retDataFrameIndexes, data.frame(rowIndex = i))
   }
  }
 }
}

retDataFrameIndexes

> retDataFrameIndexes
  rowIndex
1        1
2        2
3        4


We can take the script above, run it in SQL Server via sp_execute_external_script, pass the output back to SQL Server as the OutputDataSet, and capture the results into a temp table. Still with me? Good! Next, we'll create a cursor, iterate through the temp table of HTML table numbers, and execute sp_execute_external_script, using dynamic T-SQL to specify which data frame/HTML table number we want.

CREATE TABLE #HtmlTables (RowIndex INT);
INSERT INTO #HtmlTables 
EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'#install.packages(''XML'')
#install.packages(''RCurl'')

library(RCurl)
library(XML)
url <- ''https://itsalljustelectrons.blogspot.com/2017/02/Unpivot-Output-Of-RESTORE-HEADERONLY.html''
tables = getURL(url)
tables <- readHTMLTable(tables, header = TRUE, as.data.frame = TRUE, stringsAsFactors = F)

#Create an "empty" data frame, with a single "column" of numbers.
OutputDataSet <- data.frame(rowIndex = numeric())

#Were any html tables found?
if(length(tables)>0)
{
  #Iterate through list.
  for(i in 1:length(tables))
  { 
    #Make sure the data frame isn''t null...
    if(!is.null(tables[[i]]))
    {
      #...and that it has rows.
      if(nrow(tables[[i]]) > 0)
      {
        #Add a "row" to our data frame.
        OutputDataSet <- rbind(OutputDataSet, data.frame(rowIndex = i))
      }
    }
  }
}'

DECLARE @RowIdx INT;
DECLARE curRowIdx CURSOR FAST_FORWARD READ_ONLY FOR
 SELECT RowIndex FROM #HtmlTables;

OPEN curRowIdx;
FETCH NEXT FROM curRowIdx INTO @RowIdx;

WHILE (@@FETCH_STATUS = 0)
BEGIN
 DECLARE @RScript NVARCHAR(MAX) = N'library(RCurl)
library(XML)
url <- ''https://itsalljustelectrons.blogspot.com/2017/02/Unpivot-Output-Of-RESTORE-HEADERONLY.html'';
tables = getURL(url);
tables <- readHTMLTable(tables, header = TRUE, as.data.frame = TRUE, stringsAsFactors = F);
OutputDataSet <- tables[[' + CAST(@RowIdx AS NVARCHAR) + ']]';

 EXEC sp_execute_external_script 
  @language = N'R',
  @script = @RScript
  WITH RESULT SETS UNDEFINED;

 FETCH NEXT FROM curRowIdx INTO @RowIdx;
END

CLOSE curRowIdx;
DEALLOCATE curRowIdx;
DROP TABLE #HtmlTables;

Here's the output from SSMS:

Dave Mason SQL Server R sp_execute_external_script

So there you have it. Multiple output datasets with R and SQL Server. All it took was R (natch), sp_execute_external_script running essentially the same script multiple times, some dynamic T-SQL, a temp table, and a cursor. After all of that, I think I want to throw up. This was a long post--thanks for staying with me to the end. As always, happy coding!

UPDATE
I thoughtlessly forgot to credit some friends on Slack. Thank you for your input and assistance!

  • Steph Locke
  • Kevin Feasel
  • John G Hohengarten

  • SHARE

    No comments :

    Post a Comment

    Subscribe