NBA Games Are Harder Than Ever To Watch

The 2020-2021 NBA season hasn't been as fulfilling as I had hoped. It's not just because my beloved Boston Celtics aren't doing as well as hoped. It's because of the in-game camera work. I noticed things started getting weird last season during the playoffs. And it's gotten weirder and weirder since then.


Traditional Camera Angle

Here's an example of the classic camera angle that has always worked well. You can generally see all five players on each team, frequently from head to toe. The entirety of the half-court is visible. Even Grant Williams, waiting patiently in the lower left for an open "corner three" is fully visible. This works. It is perfection.


Low Camera Angles

I think ABC and/or ESPN is to blame for this. With the low camera angle that seems to be en vogue, there is still some elevation--the camera isn't ground-level. But it's significantly lower than the camera height of the traditional angle. Players that are far apart can seem bunched up. Three players for the Wizards (labeled 1, 2, 3) look like a pile of bodies. Note that two of them are on opposite sides of the lane, which is 16 feet wide. Aaron Nesmith for the Celtics (labeled 4) might be mistaken for a bench player standing up from the bench.

NBA Camera Angles - Too Low

Here's another example. Again we see multiple players standing significantly apart (players 2 and 3 for Washington). One of them is badly obscurred by the other. Making matters worse, the fifth Celtic can't even be seen. What you can't tell from this camera angle is that Carsen Edwards is standing behind the three-point line near the baseline. This looks like five on four. It is inexscusable.

NBA Camera Angles - Too Low

The low camera angle stinks in the half court. And it stinks when teams are in transition too. Here you can only see two players from each team.

NBA Camera Angles - Too Low

Other Strange Angles

It's not just the low camera angle. There are other questionable choices with the camera work. Here's what was shown when Grant Williams inbounded the ball from the sideline from the front court. You can only see part of the court, several players are off camera, and you can't see the basket at all.

NBA Camera Angles - Inbound

This is essentially the same thing, but with a birds-eye view. You still can't see all of the players or the basket.

NBA Camera Angles - Inbound

Who is Semi Ojeleye inbounding the ball to?

NBA Camera Angles - Inbound

Here we get an odd camera angle for a jump ball. That's 7' 6" Tacko Fall (one of the 40 tallest people in the world) for Boston and 6' 6" Troy Brown Jr. for Washington. They look the same size.

NBA Camera Angles - Jump Ball

This is another bad angle. Pop-quiz: who is shooting this free throw?

NBA Camera Angles - Free Throw

There's an abundance of bad camera angles for free throws. I counted around ten in the game where these images are from. It's too much.


Closing Thoughts

If I had the ear of those responsible, I'd want them to know they're trying too hard. I'd ask them to take to heart the old addage KISS: keep it simple, stupid. Stick with the traditional camera angles.


T-SQL Tuesday #135: Outstanding Tools

Dave Mason T-SQL Tuesday

For T-SQL Tuesday #135, Mikey Bronowski (b|t) asks us about the tools we use. What are the most helpful and effective tools you use or know of? Great questions, Mikey! I've got a few in mind for sure. And although I usually blog about SQL Server or something data-related, these are more generic minded tools...at least for us Windows folks.


Notepad

Yep, Notepad. I use it for all sorts of things. Whether it's code files, configuration/settings files, files for third-party apps, or just plain old text/log files, Notepad is my go-to plain text viewer and editor. Want to level up your Notepad skills? Create a shortcut to the Notepad.exe file and drop it in the "SendTo" folder. Now you can open any file with Notepad easily from Windows Explorer. Neat!


Paint.NET

I'm a proponent of "Show me, don't tell me". Screen captures go a long way toward that. Sure, Windows has Paint, but Paint.NET is a bit more advanced. I almost always have an instance of it open in the background. Hit the Print Screen keyboard button and paste (CTRL + V) it into Paint.NET (you can also use the ALT + Print Screen key combination to screen capture just the active window). From there you can do anything: trim down to just a specific part of the image, add some red arrows or circles for emphasis, blur/obscur any sensitive data that's in the image, etc. I take tweaked screen shots and paste them into just about anything...Word documents, email, even Twitter.


Spacesniffer

Running low on disk space and having a hard time figuring out where all your free storage went? There are numerous utilities that can help out. My favorite is SpaceSniffer. There's a zipped download that doesn't require a typical setup.exe installation. This makes it very convenient to copy to other machines you may be repsonsible for.


KeePass

If you're not using a password manager, I strongly suggest it. There are many good ones out there with different features and pros & cons. I've settled on KeePass. But if it's not KeePass, please use something!


7-Zip

Windows' built-in zip/compression functionality does what I need most of the time. If I need to password protect a zipped file, though, 7-Zip is the tool I use. It also has many additional features above and beyond what Windows provides (archive format, compression level, split to volumes, etc.).


Remote Desktop Connection Manager

If you find yourself connecting to a large number of virtual machines, managing those connections can be a chore. Before using Remote Desktop Connection Manager (aka "RDCMan"), I used to open the built-in Windows Remote Desktop Connection tool, hope the computer I wanted to connect to was in the drop-down list of computers, and manually enter my credentials. I got fancy and started using a DOS batch file for each computer I connected to regularly. But I still had to manually enter credentials. Enter Remote Desktop Connection Manager: it lets you create and organize groups of remote desktop connections. It handles credentials and a few other things too. Unfortunately, a vulnerability in the tool was discovered and Microsoft decided to retire the tool. If you have a different one to suggest, please drop me a note in the comments!


Headphones

The pandemic has resulted in many of us working from home. I'm sure we've all struggled during online meetings with poor audio and feedback that makes it hard for everyone to hear each other. Headphones are immensely better than using the built-in speakers and microphone of a laptop or desktop computer. This headset that I've been using for a few years is a Logitech model. It was cheap too--only about $30 (US). The microphone arm can be adjusted in different positions, and you can even flip it up and entirely out of the way if you just want to use the headset to listen to music. There's a physical mute button that is very handy, along with volume controls. It's USB too--I didn't have to install anything on Windows. Audiophiles may turn their noses up at this model. You can spend more (a lot more!) on a better microphone/headset setup that sounds ostensibly better. But compared to everyone else without a headset, you'll be playing chess while they're playing checkers.


Microsoft Certifications: an Evolving Stance

Dave Mason Microsoft Certification

I've been a longtime proponent of certifications. This is not a majority position--there are many who feel certs are worthless pieces of paper. But in my experience, the process of preparing for an exam has been an unmatched learning experience. Still, I haven't earned a ton of certifications. And sometimes I go for long stretches without pursuing one. In fact, it's been more than two years since my last cert. Why have I waited so long? In a word, Azure.

My preferred method to prep for a certification is a Microsoft Press book. They create the test, so it makes sense to use their study guide. Microsoft's new role-based certifications are Azure-centric. For those, there's not much out there. One of the explanations I had heard was that the Azure ecosystem changes so rapidly, by the time a book was published, a significant amount of information would be outdated. So Microsoft isn't in that business much any more (if at all). This makes sense. I've noted similar observations from numerous speakers in the SQL Server community--the slide decks and other content they meticulously pore over and compile quickly become outdated, forcing them to frequently update their presentations.

In turn, I think it's fair to say the current role-based Azure certification exams are moving targets. This poses two problems. One, how do I know I'm studying (and learning!) the right material? And two, how long after earning a cert will the material I learned be irrelevant?

I've heard of some people that can prep for a test in a few weeks or less. I am not one of those people. It usually takes me three to four months to prepare. I don't think that cadence works any more. As for irrelevancy, it's been exhausting trying to convince much of the IT world that certifications have value. A quickly outdated Azure cert makes it even harder to push that boulder up a hill.


Unpopular Opinion

If you have pearls nearby, now would be a good time to clutch them: I don't like Azure. Oh, hey...you're still here. Thanks for sticking around! The thing is, I don't dislike it, either. I just don't use Azure for much of anything. From what I've gathered, the lion's share of work is still happening on premises. But you wouldn't know it in looking at Microsoft's latest certificaiton offerings.

It's not just SQL Server either. Windows certifications are gone too. C# and .NET certifications for developers are suffering a similar fate. Microsoft's focus is clearly in the cloud. And that's ok. But what is to become of their flagship relational database management system, operating system, development platform, and programming language? In the absence of information from Redmond, all we can do is speculate. Without knowing anything about the path forward, my thoughts have turned gloomy.


Counterpoint

My hesitatancy to go all in with Azure has led me to ask myself a few questions. Am I being stubborn? Am I being lazy? Am I resistant to change? I don't rule any of those out. Those questions warrant a solid "maybe". I've got fewer years in my career ahead of me than behind, though. I need to be sensible about where I invest my time and energy. I'd also point out I haven't stopped learning, having recently finished two EdX courses (Data Science: R Basics and Data Science: Visualization).

Thinking back to my first certification experience (for Oracle), it didn't benefit me directly. I don't have regrets--I learned a lot of about database theory, development, and infrastructure. But if I had it to do over again, I'd have chosen a different path. I'll be watching the Microsoft cert landscape, hoping there will be new offerings for the on premises crowd. Until then, my instincts tell me there are better paths for me than Azure.


Adding Custom Transact-SQL Code Snippets

Dave Mason - SQL Server Management Studio

I have a number of scripts and queries I've written and curated over the years. They're not organized that well, scattered across different folders. Some are named poorly or grouped in a questionable manner. There are a handful that I tend to use the most. And yet with that small number, I sometimes have difficulty quickly finding a particular script (if I can find it at all), or spending too many mouse clicks to find it. It dawned on me recently to make use of code snippets.

Code snippets may have been intended primarily to aid code writing, but they can assist with administrative tasks too. Regardless of purpose, SSMS supports creating your own snippets, and we can create our first snippet in three steps:

  1. Copy an existing code snippet file.
  2. Edit it.
  3. Import it into SSMS.

Step 1

To get started, open the Code Snippets Manager from within SSMS via the Tools menu.

Dave Mason - SSMS Code Snippets Manager

Note the location where the built-in "Function" snippets are located, and then navigate to that path in Windows Explorer (you can highlight the path with your mouse and copy it, if desired).

Dave Mason - Code Snippets Manager

Find the "Begin End.snippet" file, and copy it. Create a "My Snippets" folder in your "My Documents" folder, paste the file into it, and rename the file to "Last Restart.snippet".


Step 2

Next, open the file in a text editor. I use Visual Studio 2019, but any editor will do. You'll probably notice the snippet file is comprised of XML data. Here's what it looks like:

<?xml version="1.0" encoding="utf-8" ?>
<CodeSnippets  xmlns="http://schemas.microsoft.com/VisualStudio/2005/CodeSnippet">
  <_locDefinition xmlns="urn:locstudio">
    <_locDefault _loc="locNone" />
    <_locTag _loc="locData">Title</_locTag>
    <_locTag _loc="locData">Description</_locTag>
    <_locTag _loc="locData">Author</_locTag>
    <_locTag _loc="locData">ToolTip</_locTag>
  </_locDefinition>
  <CodeSnippet Format="1.0.0">
    <Header>
      <Title>Begin</Title>
      <Shortcut></Shortcut>
      <Description>Code Snippet for Begin...End block.</Description>
      <Author>Microsoft Corporation</Author>
      <SnippetTypes>
        <SnippetType>SurroundsWith</SnippetType>
      </SnippetTypes>
    </Header>
    <Snippet>
      <Declarations></Declarations>
      <Code Language="SQL">
        <![CDATA[
BEGIN

$selected$ $end$

END
]]>
      </Code>
    </Snippet>
  </CodeSnippet>
</CodeSnippets>

Now we'll edit the file to create a "Last Restart" code snippet that inserts this query to an SSMS code tab/window:

SELECT i.sqlserver_start_time
FROM sys.dm_os_sys_info i;

The values for the Title, Description, and Author nodes (highlighted in yellow) become "Last Restart", "Query that returns the date and time SQL was last started.", and "Dave Mason @BeginTry" respectively. The SnippetTypes node (highlighted in orange) is not needed--delete it. The code for our snippet query (also highlighted in yellow) goes in the Code node, within the CDATA section. Here is the end result:

<?xml version="1.0" encoding="utf-8" ?>
<CodeSnippets  xmlns="http://schemas.microsoft.com/VisualStudio/2005/CodeSnippet">
  <_locDefinition xmlns="urn:locstudio">
    <_locDefault _loc="locNone" />
    <_locTag _loc="locData">Title</_locTag>
    <_locTag _loc="locData">Description</_locTag>
    <_locTag _loc="locData">Author</_locTag>
    <_locTag _loc="locData">ToolTip</_locTag>
  </_locDefinition>
  <CodeSnippet Format="1.0.0">
    <Header>
      <Title>Last Restart</Title>
      <Shortcut></Shortcut>
      <Description>Query that returns the date and time SQL was last started.</Description>
      <Author>Dave Mason @BeginTry</Author>
    </Header>
    <Snippet>
      <Code Language="SQL">
        <![CDATA[SELECT i.sqlserver_start_time
FROM sys.dm_os_sys_info i;]]>
      </Code>
    </Snippet>
  </CodeSnippet>
</CodeSnippets>


Step 3

Lastly, we'll import the code snippet to SSMS, so return to Code Snippets Manager. I had some trouble importing *.snippet files into the default "My Code Snippets" category/folder (perhaps a folder permission issue?). I removed "My Code Snippets", clicked the "Add.." button, navigated to and selected the "My Snippets" folder I had previously created within "My Documents". Expanding "My Snippets" reveals the new snippet that was created.

Dave Mason - Code Snippets Manager

Using The New Snippet

Click OK to close Code Snippets Manager (if it's still open) and open a new SSMS code tab/window.

  1. Right-click anywhere and select "Insert Snippet...".
  2. A drop-down list of snippet categories appears. Double-click the "My snippets" category folder.
  3. The "Last Restart" code snippet will be available (note the Tooltip). Double-click "Last Restart".
  4. The query from our snippet appears in the SSMS window.

Dave Mason - SSMS Insert Snippet
Dave Mason - SSMS Snippet Categories
Dave Mason - Last SQL Server Restart
Dave Mason - Last SQL Server Restart

Additional Snippets

If you want to add more snippets to SSMS, create a new *.snippet file, edit it, and copy it into the existing folder with "Last Restart.snippet". If the new snippet adheres to the Code Snippets Schema, it will show up automatically in SSMS.

I've created some other snippets that are available in this Github repository: BeginTry/SSMS-Code-Snippets. Many of them include replacement points, which will be of interest. Happy coding, everyone.


Transact-SQL Code Snippets

There are certain T-SQL statements whose syntax I have trouble remembering. When those situations arise, I might look up the syntax online; find the same type of object in SSMS, right-click it, script out the object, and use that as a starting point; or find one of my own scripts that has the syntax I'm looking for. Another option that I often overlook is T-SQL code snippets.

What is a code snippet? Here's the definition from the Microsoft documentation:

A Transact-SQL code snippet is a template containing the basic structure of a Transact-SQL statement or block. You can use snippets as a starting point when adding statements in the Database Engine Query Editor. You can insert the pre-defined snippets supplied with SQL Server, or create your own.

Here's what it looks like in SSMS:

  1. Right-click in an SSMS window and click "Insert Snippet..."
  2. A drop-down list of snippet categories appears. Double-click one of the category folders to see the snippets in that group.
  3. A drop-down list of code snippets appears. Hover your mouse over any of them to see the snippet definition tooltip. Double-click one to insert the code to the SSMS window.
  4. If the snippet has replacement points (highlighted in yellow), the TAB key will cycle through them allowing you to conveniently replace each with the values you want. If you rest your mouse pointer on a replacement point, a tooltip appears with a description.

Dave Mason - SSMS - Insert Snippet
Dave Mason - Snippet Categories
Dave Mason - Code Snippets
Dave Mason - Code Snippet Replacement Points

SSMS has pre-defined snippets for the following categories:

  1. Function
  2. Index
  3. Login
  4. Role
  5. Schema
  6. Stored Procedure
  7. Synonym
  8. Table
  9. Trigger
  10. User
  11. User Defined Data Type
  12. User Defined Table Type
  13. User Defined Type
  14. View

Check 'em out. And happy coding!

(Would you like to create your own, custom code snippets? Continue reading...)


Columnstore For Real-Time Operational Analytics

SQL Server Columnstore - Dave Mason

Starting with SQL Server 2016, new columnstore indexing features allow something called real-time operational analytics. Microsoft's documentation describes it as "the ability to run both analytics and OLTP workloads on the same database tables at the same time". Additionally, it can also "eliminate the need for ETL and a data warehouse".

I tend to work mostly with OLTP environments. Many of them have questionable designs or serve reporting workloads. Not surprisingly, there are a lot of performance-sapping table scans and index scans. I've compensated for this somewhat by using row and page compression, which became available on all editions of SQL Server starting with SQL Server 2016 SP1. Could I get even better results with columnstore indexes? Lets look at one example.

Here are four individual query statements from a stored procedure used to get data for a dashboard. If you add up percentages for Estimated Cost (CPU + IO), Estimated CPU Cost, or Estimated IO Cost, you get a total of about 90% (give or take a few percent).

SQL Server Columnstore - Dave Mason

The first two statements include a nonclustered index scan and the second two both have a clustered index scan. Each of the four scans consumes a large portion of the CPU and/or IO. Two nonclustered columnstore indexes were added. Now lets take a look at the before and after images of the query plans.


Statement 1

Index scan and index seek operators are both replaced by columnstore index scans. Note the table insert operation--presumably, its overall cost would be consistent. Yet as a percentage, it goes up significantly, becoming more "expensive" by virtue of the other operations being more efficient and less "expensive".

SQL Server Rowstore - Dave Mason

SQL Server Columnstore - Dave Mason


Statement 2

As in the previous statement, index scan and index seek operators are both replaced by columnstore index scans. We also see the table insert operation increase in cost.

SQL Server Rowstore - Dave Mason

SQL Server Columnstore - Dave Mason


Statements 3 and 4

The final two statements are nearly identical, so I won't repeat them. It's a different table this time, and it has a clustered index seek and a clustered index scan. Both operations are replaced by columnstore index scans.

SQL Server Rowstore - Dave Mason

SQL Server Columnstore - Dave Mason


With the nonclustered columnstore indexes in place, the relative costs of the four individual query statements go down significantly. Summed up, Estimated Cost (CPU + IO) is down to 58%, Estimated CPU Cost is down to 66%, and Estimated IO Cost is down to 55%.

SQL Server Columnstore - Dave Mason


The execution stats look even better. Elapsed time for the stored procedure drops from 15 seconds to 5 seconds. Logical reads also dropped by 80%. And there's even more good news. By keeping a close eye on index usage stats, I noticed some of the other rowstore indexes on the two tables were no longer being used for index scans. Various other executions plans were now using the new nonclustered columnstore index for those scans instead. Many of those plans saw similar performance gains with reduced CPU and IO.

I am still somewhat new to columnstore indexes, and my sample size is admittedly small. But so far, real-time operational analytics appears to be the best of both worlds. Reporting and analytical queries that scan wide swaths of data are working well with columnstore indexes, while existing rowstore indexes provide traditionally efficient seeks. Are you using nonclustered columnstore indexes on OLTP tables too? Please leave a comment! I haven't found anyone else that's doing this yet. Happy indexing, everyone.


Scraping Web Pages - Part 2

Dave Mason - RCurl XML

In the last post, we looked at a way to scrape HTML table data from web pages, and save the data to a table in SQL Server. One of the drawbacks is the need to know the schema of the data that gets scraped--you need a SQL Server table to store the data, after all. Another shortcoming is if there are multiple HTML tables, you need to identify which one(s) you want to save.

For this post, we'll revisit web scraping with Machine Learning Services and R. This time, we'll take a schema-less approach that returns JSON data. As before, this web page will be scraped: Boston Celtics 2016-2017. It shows two HTML tables (grids) of data for the Boston Celtics, a professional basketball team. The first grid lists the roster of players, the second is a listing of games played during the regular season.


R Code

This code uses the RCurl and XML packages (as was the case in the last post), plus the jsonlite package. There's a bit more code this time. Hopefully the code comments are helpful:

library(RCurl)
library(XML)
library(jsonlite)

#type.convert - sets appropriate class for each data frame variable/column.
TypeConvertDataFrameList <- function(lsDataFrame) {
    lapply(lsDataFrame, type.convert, as.is = TRUE)
}

url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2016-2017.html"
html = RCurl::getURL(url)

dfHtmlTableData <- XML::readHTMLTable(html, header = TRUE,
                                 as.data.frame = TRUE,
                                 stringsAsFactors = FALSE)
#remove null elements from list.
dfHtmlTableData <- dfHtmlTableData[!sapply(dfHtmlTableData, is.null)]

#convert data frame variables to appropriate classes.
dfHtmlTableData <- lapply(dfHtmlTableData, TypeConvertDataFrameList)

dfTables <- data.frame(url = rep(c(url), each = length(dfHtmlTableData)))

#Empty vector for JSON string data.
jsonStringsFactor <- rep(c(""), each = length(dfHtmlTableData))

if (length(dfHtmlTableData) > 0) {
    #Convert each HTML table to JSON.
    for (i in 1:length(dfHtmlTableData)) {
        jsonStringsFactor[i] <- jsonlite::toJSON(
      as.data.frame(dfHtmlTableData[[i]]), auto_unbox = TRUE)
    }
}

dfTables$HtmlTableAsJson <- jsonStringsFactor
dfTables

Lines 6-8 define a separate function that is invoked farther down in the script. Line 10 specifies the URL string for the web page we want to scrape. Line 11 gathers the entirety of HTML from the page. And Line 13 creates a list of data frames. At line 17, NULL elements (if any) are removed from the list. At line 20 the lapply is invoked. It iterates over the list of HTML table data and in turn invokes the TypeConvertDataFrameList from line 5. This function iterates over the variables of a data frame (much like columns of a database table) and attempts to cast them from a character (string) class to a different class, if appropriate. Without this, all of our scraped data will be returned as strings.

At line 22 we begin creating the single data frame that will be returned to SQL Server. A "url" variable (column) is created with one observation (row) for each HTML table found. Each value is the URL of the web page that was scraped. At line 25, a jsonStringsFactor vector is created with one element for each HTML table found. It has empty string values initially. In lines 29-31, the code iterates over each data frame elements in dfHtmlTableData. The data from each data frame element is converted to JSON and assigned to the corresponding element in jsonStringsFactor. Line 35 creates a "HtmlTableAsJson" property (column) for our data frame, and assigns jsonStringsFactor to it. We now have a data frame with two variables (columns) and three observations (rows). Here is the output of the code (truncated for display purposes):

url                                     HtmlTableAsJson                                          
---------------------------------       ----------------------------------------------------------------------------------------------------------------------------------------------------------------
https://...celtics-2016-2017.html       [{"Jersey.Number":0,"Player":"Avery Bradley","Position":"SG","Height":"6-3","Weight":180,"Birthdate":"11/26/1990","Years.Experience":"6","College":"Texas"},...]
https://...celtics-2016-2017.html       [{"Game":1,"Date":"Wed, Oct 26, 2016","Time":"7:30p","At":"","Opponent":"Brooklyn Nets","Result":"W","OT":"","Points.For":122,"Points.Against":117,"Wins.Sea...]
https://...celtics-2016-2017.html       [{}]


SQL Server

With the necessary R code in hand, we can run it in-database with Machine Learning Services. I've wrapped the code within a stored procedure:

CREATE OR ALTER PROCEDURE dbo.ScrapeHtmlTables
 @Url NVARCHAR(MAX)
AS
BEGIN

DECLARE @RScript NVARCHAR(MAX);
SET @RScript = N'library(RCurl)
library(XML)
library(jsonlite)

#type.convert - sets appropriate class for each data frame variable/column.
TypeConvertDataFrameList <- function(lsDataFrame){
  lapply(lsDataFrame, type.convert, as.is = TRUE)
}

url <- "' + @Url + '"
html = RCurl::getURL(url)

dfHtmlTableData <- XML::readHTMLTable(html, header = TRUE,
                                 as.data.frame = TRUE,
                                 stringsAsFactors = FALSE)
#remove null elements from list.
dfHtmlTableData <- dfHtmlTableData[!sapply(dfHtmlTableData, is.null)]

#convert data frame variables to appropriate classes.
dfHtmlTableData <- lapply(dfHtmlTableData, TypeConvertDataFrameList)

dfTables <- data.frame(url = rep(c(url), each = length(dfHtmlTableData)))

#Empty vector for JSON string data.
jsonStringsFactor <- rep(c(""), each = length(dfHtmlTableData))

if (length(dfHtmlTableData) > 0) {
    #Convert each HTML table to JSON.
    for (i in 1:length(dfHtmlTableData)) {
        jsonStringsFactor[i] <- jsonlite::toJSON(
      as.data.frame(dfHtmlTableData[[i]]), auto_unbox = TRUE)
    }
}

dfTables$HtmlTableAsJson <- jsonStringsFactor
OutputDataSet <- dfTables

OutputDataSet';

EXEC sp_execute_external_script 
 @language = N'R',
 @script = @RScript
WITH RESULT SETS (
 (
  [Url] VARCHAR(2048),
  HtmlTableAsJson VARCHAR(MAX)
 )
);
END
GO

Now we can retieve a web page's HTML table data by invoking a stored procedure and passing in a URL string. We don't need to know how many HTML tables there are, and we don't need to know their schemas:

CREATE TABLE #HtmlTableData (
 ID INT IDENTITY,
 [Url] VARCHAR(2048),
 HtmlTableAsJson VARCHAR(MAX)
)

INSERT INTO #HtmlTableData
EXEC dbo.ScrapeHtmlTables @Url = N'https://itsalljustelectrons.blogspot.com/p/boston-celtics-2016-2017.html';

SELECT *
FROM #HtmlTableData;

Dave Mason - RStats - sp_execute_external_script


JSON

From here on out, parsing the returned data is a JSON exercise. We would, of course, need to know the JSON schema to query it in a tabular or meaningful way. Or would we?

In another post, I described an automated way to infer JSON schema using dynamic TSQL. Perhaps that article will pique your interest. Here's a look at what I had in mind.

DECLARE @Json VARCHAR(MAX);
SELECT @Json = d.HtmlTableAsJson
FROM #HtmlTableData d
WHERE d.ID = 2;

EXEC dbo.OpenJsonTabular @Json;
Dave Mason - JSON

Scraping Web Pages

Dave Mason RStats

I don't consider myself a data scientist. But that hasn't stopped me from finding some use cases for Machine Learning Services for SQL Server. One of them is acquiring data by scraping web pages. If you're a regular here, you probably know R is my language of preference. However, Python has packages for scraping web pages. I would imagine java does too.

For this post, it might make more sense to skip ahead to the end result, and then work our way backwards. Here is a web page with some data: Boston Celtics 2016-2017. It shows two HTML tables (grids) of data for the Boston Celtics, a professional basketball team. The first grid lists the roster of players. We will scrape the web page, and write the data from the "roster" grid to a SQL Server table. It will look much like this: Dave Mason sp_execute_external_script


R Code

R has an abundance of packages. This frequently provides numerous options to accomplish any task. For this one, I'll be using the RCurl and XML packages. This bit of code will do most of the work in R:

library(RCurl)
library(XML)
url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2016-2017.html"
html = RCurl::getURL(url);
tables <- XML::readHTMLTable(html, header = TRUE,
                             as.data.frame = TRUE,
                             stringsAsFactors = FALSE);

Lines 1 and 2 load the necessary R packages. Line 3 specifies the URL string for the web page we want to scrape. Line 4 gathers the entirety of HTML from the page. And Line 5 creates a list of data frames. After we run the code, we can check the length of tables to see how many HTML tables were found. In this case, there are four:

> length(tables)
[1] 4
> 

Continuing on, I'll use the head() function to preview the first three rows of each data frame. This shows me the HTML table of "roster" data lies in the first data frame tables[[1]]:

> head(tables[[1]], n = 3L)
+ head(tables[[2]], n = 3L)
+ head(tables[[3]], n = 3L)
+ head(tables[[4]], n = 3L)

  Jersey Number        Player Position Height Weight  Birthdate Years Experience    College
1             0 Avery Bradley       SG    6-3    180 11/26/1990                6      Texas
2             7  Jaylen Brown       SF    6-6    223 10/24/1996                R California
3            99   Jae Crowder       SF    6-6    235 07/06/1990                4  Marquette

  Game              Date  Time At          Opponent Result OT Points For Points Against Wins-Season Losses-Season Streak
1    1 Wed, Oct 26, 2016 7:30p        Brooklyn Nets      W           122            117           1             0    W 1
2    2 Thu, Oct 27, 2016 8:00p  @     Chicago Bulls      L            99            105           1             1    L 1
3    3 Sat, Oct 29, 2016 7:00p  @ Charlotte Hornets      W           104             98           2             1    W 1

  CONNECT
1        

NULL
> 

With that, the R code that we will use in SQL Server via sp_execute_external_script will be as follows:

library(RCurl)
library(XML)
url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2016-2017.html"
html = RCurl::getURL(url);
tables <- XML::readHTMLTable(html, header = TRUE,
                             as.data.frame = TRUE,
                             stringsAsFactors = FALSE);
OutputDataSet <- tables[[1]]


SQL Server

With the necessary R code in hand, let's run it in-database with Machine Learning Services. Since we know the schema, it is specified via WITH RESULT SETS:

EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(RCurl)
library(XML)
url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2016-2017.html";
html = RCurl::getURL(url);
tables <- XML::readHTMLTable(html, header = TRUE, as.data.frame = TRUE, stringsAsFactors = FALSE);
OutputDataSet <- tables[[1]]'
WITH RESULT SETS (
 (
  JerseyNumber VARCHAR(4),
  PlayerName VARCHAR(128),
  Position VARCHAR(4),
  Height VARCHAR(8),
  Weight SMALLINT,
  BirthDate DATE,
  YearsExperience VARCHAR(2),
  College VARCHAR(128)
 )
);

Dave Mason RStats - RCurl XML

If desired, we can save the output to a temp table:

USE tempdb;
GO

DROP TABLE IF EXISTS #Roster;

CREATE TABLE #Roster (
 JerseyNumber VARCHAR(4),
 PlayerName VARCHAR(128),
 Position VARCHAR(4),
 Height VARCHAR(8),
 Weight SMALLINT,
 BirthDate DATE,
 YearsExperience VARCHAR(2),
 College VARCHAR(128)
)

INSERT INTO #Roster
EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(RCurl)
library(XML)
url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2016-2017.html";
html = RCurl::getURL(url);
tables <- XML::readHTMLTable(html, header = TRUE, as.data.frame = TRUE, stringsAsFactors = FALSE);
OutputDataSet <- tables[[1]]'

We can go a step further and write the data to a permanent table (it's tempdb in this example, but you get the idea). Here I'll prepend two columns to indicate the season and team:

USE tempdb;
GO

SELECT CAST('2016-2017' AS VARCHAR(9)) AS Season, 
 CAST('BOS' AS VARCHAR(3)) AS Team, * 
INTO dbo.Roster
FROM #Roster;

Now we have the data as depicted in the first image at the top of this post:

SELECT * FROM dbo.Roster;
Dave Mason RStats - getURL readHTMLTable

As it turns out, there are more web pages of Boston Celtics roster data I can scrape (note the underlying URLs, referenced in the code below):

With a couple of minor edits to the TSQL code, I can grab that data too:
--Data for 2017-2018 season.
INSERT INTO #Roster
EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(RCurl)
library(XML)
url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2017-2018.html";
html = RCurl::getURL(url);
tables <- XML::readHTMLTable(html, header = TRUE, as.data.frame = TRUE, stringsAsFactors = FALSE);
OutputDataSet <- tables[[1]]';

--Data for 2018-2019 season.
INSERT INTO #Roster
EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(RCurl)
library(XML)
url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2018-2019.html";
html = RCurl::getURL(url);
tables <- XML::readHTMLTable(html, header = TRUE, as.data.frame = TRUE, stringsAsFactors = FALSE);
OutputDataSet <- tables[[1]]';


Limitations of Schema

One of the drawbacks to the examples and techniques presented here is that we have to know the schema of the returned data before we can write it to a SQL Server table. If a web page has multiple HTML tables, we also have to figure out which ones have the data we intend to save. In my next post, I'll look at another option for web scraping that is not dependent on schema. In the mean time, there are a few caveats and gothcas to be aware of if you want to try web scraping in-database:

  1. If needed, the two packages can be installed as follows:
    • install.packages("RCurl")
    • install.packages("XML")
  2. By default, the SQL Server Setup disables outbound connections by creating firewall rules. Check out Firewall configuration for SQL Server Machine Learning Services.
  3. The getURL() function in the RCurl package doesn't work well (or at all) with javascript and dynamically generated HTML content. (Try scraping this page, for example.) I'm told there are other R packages that might work for those pages.
  4. Please take a long, hard look at The Web Scraping Best Practices Guide before you begin!


Data Obfuscation for SQL Server

Dave Mason - Data Obfuscation for SQL Server

In a previous post, I explored an option for generating fake data in sql server using Machine Learning services and the R language. I've expanded on that by creating some stored procedures that can be used for both generating data sets of fake data, and for obfuscating existing SQL Server data with fake data.

The code is available in a Github repository. For now, it consists of ten stored procedures. For data obfuscation, you'd use one of more of these five stored procs:

  • Obfuscator.SetAlphaNumericData
  • Obfuscator.SetEmailAddresses
  • Obfuscator.SetFullNames
  • Obfuscator.SetPhoneNumbers
  • Obfuscator.SetSSNs

Each procedure has input parameters for database name, table schema, table name, and column name(s). There is also an optional parameter to indicate whether triggers should be disabled and re-enabled. (Obfuscator.SetSSNs has an additional optional parameter to specify a formatting character.) Here are some examples from the [AdventureWorks] database, showing the data both before and after running each obfuscation proc:


Alphanumeric Data

EXEC MLtools.Obfuscator.SetAlphaNumericData
 @DatabaseName = 'Adventureworks',
 @TableSchema = 'Production',
 @TableName = 'Product',
 @AlphaNumericColumn = 'ProductNumber',
 @DisableTriggers = 1;
Obfuscator.SetAlphaNumericData - Data Obfuscation for SQL Server

BeforeAfter
Obfuscator.SetAlphaNumericData - Data Obfuscation for SQL Server Obfuscator.SetAlphaNumericData - Data Obfuscation for SQL Server


Email Addresses

EXEC MLtools.Obfuscator.SetEmailAddresses
 @DatabaseName = 'Adventureworks',
 @TableSchema = 'Person',
 @TableName = 'EmailAddress',
 @EmailAddressColumn = 'EmailAddress',
 @DisableTriggers = 1;
Obfuscator.SetEmailAddresses - Data Obfuscation for SQL Server

BeforeAfter
Obfuscator.SetEmailAddresses - Data Obfuscation for SQL Server Obfuscator.SetEmailAddresses - Data Obfuscation for SQL Server


Full Names

EXEC MLtools.Obfuscator.SetFullNames
 @DatabaseName = 'Adventureworks',
 @TableSchema = 'Person',
 @TableName = 'Person',
 @FirstNameColumn = 'FirstName',
 @LastNameColumn = 'LastName',
 @MiddleNameColumn = 'MiddleName',
 @DisableTriggers = 1;
Obfuscator.SetFullNames - Data Obfuscation for SQL Server

BeforeAfter
Obfuscator.SetFullNames - Data Obfuscation for SQL Server Obfuscator.SetFullNames - Data Obfuscation for SQL Server


Phone Numbers

EXEC MLtools.Obfuscator.SetPhoneNumbers
 @DatabaseName = 'Adventureworks',
 @TableSchema = 'Person',
 @TableName = 'PersonPhone',
 @PhoneNumColumn = 'PhoneNumber',
 @DisableTriggers = 1;
Obfuscator.SetPhoneNumbers - Data Obfuscation for SQL Server

BeforeAfter
Obfuscator.SetPhoneNumbers - Data Obfuscation for SQL Server Obfuscator.SetPhoneNumbers - Data Obfuscation for SQL Server


Social Security Numbers

EXEC MLtools.Obfuscator.SetSSNs
 @DatabaseName = 'Adventureworks',
 @TableSchema = 'HumanResources',
 @TableName = 'Employee',
 @SsnColumn = 'NationalIDNumber',
 @FormatChar = '-',
 @DisableTriggers = 1;
Obfuscator.SetSSNs - Data Obfuscation for SQL Server

BeforeAfter
Obfuscator.SetSSNs - Data Obfuscation for SQL Server Obfuscator.SetSSNs - Data Obfuscation for SQL Server


Design Notes

The "Obfuscator" stored procedures each invoke a "Generator" procedure. These returns result sets of fake data without performing any changes to existing data:

  • Generator.GetAlphaNumericData
  • Generator.GetEmailAddresses
  • Generator.GetFullNames
  • Generator.GetPhoneNumbers
  • Generator.GetSSNs

The "Generator" and "Obfuscator" functionality were separated for a couple of reasons. The first, perhaps more obvious reason, is that each type of procedure has a different purpose. The other reason has to do with script language. My code use the R language and takes advantage of the "generator" package for R (among others). But perhaps R isn't your favorite. The procedures could easily be rewritten using Python (starting with SQL 2017) or Java (starting with SQL 2019). Those languages may offer other packages/libraries with better or different functionality for whatever your needs are. Or you may simply have more experience with (or a preference for) a different scripting language.

All of the stored procedures have a header and code comments--hopefully it is easy to understand and is not too arcane. Happy obfuscating!