A Faster PIVOT for SQL Server?

Dave Mason - SQL Server - PIVOT - OPENJSON

While working with some poorly performing code in T-SQL that used a PIVOT operator, I wondered if there was a more efficient way to get a result set of pivoted data. It may have been a fool's errand, but I still wanted to try. It dawned on me that I could use the STRING_AGG() function to build a delimited list of pivoted column names and values. From there, I'd have to "shred" the delimited data to rows and columns. But how?

SQL Server's longtime support for XML came to mind. But I quickly dismissed it--it's not my cup of tea and I've never thought of SQL Server combined with XML as a well-performing combination. What about using JSON? Could that work? And if so, would it perform well?

Below is the design pattern I settled on. There is a @Json variable that (as the name implies) holds all of the data from the result set, formatted as JSON. It gets passed to the OPENJSON() function, which returns a result set of typed columns based on what is specified in the WITH clause. The columns we *don't* want to pivot should be known at design time when the query is written--they're represented in green. The columns and data we do want to pivot may or may not be known ahead of time--they're represented in yellow. In my experience, they usually aren't known, and dynamic T-SQL must be used. I'll assume that's the case and use dynamic T-SQL for an example.


The Design Pattern


DECLARE @Json VARCHAR(MAX) = 
'[
{ "ColName_1":"Val_1","ColName_2":"Val_2", ... "ColName_N":"Val_N", "PivColName_1":"PVal_1","PivColName_2":"PVal_2", ... "PivColName_N":"PVal_N"},
{ "ColName_1":"Val_1","ColName_2":"Val_2", ... "ColName_N":"Val_N", "PivColName_1":"PVal_1","PivColName_2":"PVal_2", ... "PivColName_N":"PVal_N"},
{ "ColName_1":"Val_1","ColName_2":"Val_2", ... "ColName_N":"Val_N", "PivColName_1":"PVal_1","PivColName_2":"PVal_2", ... "PivColName_N":"PVal_N"},

{ "ColName_1":"Val_1","ColName_2":"Val_2", ... "ColName_N":"Val_N", "PivColName_1":"PVal_1","PivColName_2":"PVal_2", ... "PivColName_N":"PVal_N"}
]';

SELECT *
FROM OPENJSON(@Json)
WITH (
	ColName_1 <datatype>, ColName_2 <datatype>, ... ColName_N <datatype>, PivColName_1 <datatype>, PivColName_2 <datatype>, ... PivColName_N <datatype>
)


Sample Code

Here are two queries you can try for yourself with the [WideWorldImporters] database.

USE WideWorldImporters;

DECLARE @SQL AS VARCHAR(MAX);
DECLARE @WithCols VARCHAR(MAX);
DECLARE @Json VARCHAR(MAX);

;WITH ColNames AS
(
	SELECT DISTINCT AccountsPersonID
	FROM Sales.Invoices
)
SELECT 
	--Pivoted column names that will be specified in the OPENJSON WITH clause.
	@WithCols = '[InvoiceID] INT ''$.InvoiceID'','  + 
		STRING_AGG('[' + CAST(AccountsPersonID AS VARCHAR(MAX)) + '] INT ''$."' + CAST(AccountsPersonID AS VARCHAR(MAX)) + '"''', ', ') 
		WITHIN GROUP ( ORDER BY AccountsPersonID )
FROM ColNames;

SELECT @Json = '[' + STRING_AGG(CAST(t.JsonData AS VARCHAR(MAX)), ',') + ']'
FROM
(
	SELECT 
		N'{' + ' "InvoiceID":' + CAST(s.InvoiceID AS VARCHAR(MAX)) + 
			COALESCE(',' + STRING_AGG('"' + CAST(s.AccountsPersonID AS VARCHAR(MAX)) + '":' + CAST(s.Quantity AS VARCHAR(MAX)), ','), '') + 
		'}' 
		AS JsonData
	--Derived table with aggregated data.
	FROM (
		SELECT i.InvoiceID, i.AccountsPersonID, SUM(l.Quantity) AS Quantity
		FROM Sales.Invoices i
		JOIN Sales.InvoiceLines l
			ON l.InvoiceID = i.InvoiceID
		GROUP BY i.InvoiceID, AccountsPersonID
	) AS s
	GROUP BY s.InvoiceID
) t

--Parse the JSON data and return a result set.
SET @SQL = '
SELECT *
FROM OPENJSON(''' + @Json + ''')
	WITH
	(
		' + @WithCols + '
	)
';
EXEC(@SQL);
GO
USE WideWorldImporters;
DECLARE @Cols VARCHAR(MAX);

--Pivoted column names that will be specified in the PIVOT operator.
;WITH ColNames AS
(
	SELECT DISTINCT AccountsPersonID
	FROM Sales.Invoices
)
SELECT @Cols = STRING_AGG(QUOTENAME(AccountsPersonID), ',') WITHIN GROUP (ORDER BY AccountsPersonID)
FROM ColNames;

DECLARE @SQL VARCHAR(MAX) = '
SELECT *
FROM   
(
	SELECT l.InvoiceID, i.AccountsPersonID, l.Quantity
	FROM Sales.Invoices i
	JOIN Sales.InvoiceLines l
		ON l.InvoiceID = i.InvoiceID
) p  
PIVOT  
(  
	SUM(Quantity)  
	FOR AccountsPersonID IN  
	( ' + @Cols + '
	)  
) AS pvt';

EXEC(@SQL);
GO

Results and Comparison

Both query execution plans include a nonclustered columnstore index scan on [Sales].[InvoiceLines], and a nonclustered row index scan on [Sales].[Invoices]. And they each have the same number of reads. If I return the results to the SSMS grid, PIVOT takes longer--both in terms of duration and CPU.

Dave Mason - OPENJSON - STRING_AGG

Perhaps a better comparison is to discard grid results after execution. With that setting specified in SSMS options, the execution plans remain the same. But duration and CPU are down for both plans. Here again, the plan using PIVOT takes longer.

Dave Mason - OPENJSON - STRING_AGG

Conclusions

So did I beat PIVOT? Definitely maybe. At least for this one contrived example. I do have another example that is running in a test environment with one of my clients, and it beats PIVOT too. It has a result set of about 200 columns and 44 thousand rows. I'm not ready to declare victory, but it is intriguing. I'd like to test many, many more queries. The slight gains demonstrated here (even if consistently reproducible) are arguably not worth the hassle. I'll let you be the judge.


Analytic Function FIRST_VALUE

Dave Mason SQL Server FIRST_VALUE

Last week, I found myself with a T-SQL problem. A customer has a database where multiple "widget" records might exist in a table. End users might enter the duplicate rows, or they might be inserted by other automated processes. My task was to keep the most recent widget record. But, if there were any NULL values, look in the previous "duplicate" rows and use the most recent missing value for each column.

Here's a contrived data example. I needed to take the following result set...

Row_ID Tomash_ID Hoppler_ID Atrovel_ID Mogotrevo_ID Buwheal_ID Izzillent_ID
1 NULL 2 3 14 55 16
2 21 22 3 4 5 NULL
3 12 NULL 3 44 NULL 6
4 11 22 NULL 4 NULL 66
5 11 2 3 14 NULL NULL
6 1 2 3 NULL NULL NULL

...and reduce it to this:

Row_ID Tomash_ID Hoppler_ID Atrovel_ID Mogotrevo_ID Buwheal_ID Izzillent_ID
6 1 2 3 14 5 66

There was an existing stored procedure that resolved the duplicates. But it had quite a bit of procedural code. There were separate queries that looked up the most recent value for each column, followed by an update to the "good" row. I was looking for a set-based approach. I'd made a few various attempts. Some were close, others were ridiculous, all were unsuccessful. Then I stumbled upon the FIRST_VALUE analytic function for T-SQL.


FIRST_VALUE

As its name indicates, FIRST_VALUE returns the first value in an ordered set of values. It requires an OVER clause along with an ORDER BY clause. There are optional PARTITION BY and ROWS/RANGE clauses -- check out the documentation for all the details.

Below is the query I used to get the desired result (along with some sample data so you can try yourself). The CASE expression is used to avoid returning NULL values--it orders them last. (There are IGNORE NULLS | RESPECT NULLS parameters that would seem to work well here, but the documentation says those only apply to Azure SQL Edge.)


SELECT TOP(1)
	Row_ID,
	FIRST_VALUE(Tomash_ID) OVER (ORDER BY CASE WHEN Tomash_ID IS NULL THEN 1 ELSE 0 END, Row_ID DESC) AS Tomash_ID_MostRecent,
	FIRST_VALUE(Hoppler_ID) OVER (ORDER BY CASE WHEN Hoppler_ID IS NULL THEN 1 ELSE 0 END, Row_ID DESC) AS Hoppler_ID_MostRecent,
	FIRST_VALUE(Atrovel_ID) OVER (ORDER BY CASE WHEN Atrovel_ID IS NULL THEN 1 ELSE 0 END, Row_ID DESC) AS Atrovel_ID_MostRecent,
	FIRST_VALUE(Mogotrevo_ID) OVER (ORDER BY CASE WHEN Mogotrevo_ID IS NULL THEN 1 ELSE 0 END, Row_ID DESC) AS Mogotrevo_ID_MostRecent,
	FIRST_VALUE(Buwheal_ID) OVER (ORDER BY CASE WHEN Buwheal_ID IS NULL THEN 1 ELSE 0 END, Row_ID DESC) AS Buwheal_ID_MostRecent,
	FIRST_VALUE(Izzillent_ID) OVER (ORDER BY CASE WHEN Izzillent_ID IS NULL THEN 1 ELSE 0 END, Row_ID DESC) AS Izzillent_ID_MostRecent
FROM #Test
ORDER BY Row_ID DESC
DROP TABLE IF EXISTS #Test;

CREATE TABLE #Test (
	Row_ID INT IDENTITY NOT NULL,
	Tomash_ID INT, 
	Hoppler_ID INT, 
	Atrovel_ID INT, 
	Mogotrevo_ID INT, 
	Buwheal_ID INT, 
	Izzillent_ID INT
);

INSERT INTO #Test VALUES(NULL,2,3,14,55,16);
INSERT INTO #Test VALUES(21,22,3,4,5,NULL);
INSERT INTO #Test VALUES(12,NULL,3,44,NULL,6);
INSERT INTO #Test VALUES(11,22,NULL,4,NULL,66);
INSERT INTO #Test VALUES(11,2,3,14,NULL,NULL);
INSERT INTO #Test VALUES(1,2,3,NULL,NULL,NULL);

The presence of a FIRST_VALUE function might lead you to believe there is a LAST_VALUE function. And there is! It returns the last value in an ordered set of values, but is otherwise the same in syntax and usage as FIRST_VALUE is. Happy coding, everyone.


NBA Games Are Harder Than Ever To Watch

The 2020-2021 NBA season hasn't been as fulfilling as I had hoped. It's not just because my beloved Boston Celtics aren't doing as well as hoped. It's because of the in-game camera work. I noticed things started getting weird last season during the playoffs. And it's gotten weirder and weirder since then.


Traditional Camera Angle

Here's an example of the classic camera angle that has always worked well. You can generally see all five players on each team, frequently from head to toe. The entirety of the half-court is visible. Even Grant Williams, waiting patiently in the lower left for an open "corner three" is fully visible. This works. It is perfection.


Low Camera Angles

I think ABC and/or ESPN is to blame for this. With the low camera angle that seems to be en vogue, there is still some elevation--the camera isn't ground-level. But it's significantly lower than the camera height of the traditional angle. Players that are far apart can seem bunched up. Three players for the Wizards (labeled 1, 2, 3) look like a pile of bodies. Note that two of them are on opposite sides of the lane, which is 16 feet wide. Aaron Nesmith for the Celtics (labeled 4) might be mistaken for a bench player standing up from the bench.

NBA Camera Angles - Too Low

Here's another example. Again we see multiple players standing significantly apart (players 2 and 3 for Washington). One of them is badly obscurred by the other. Making matters worse, the fifth Celtic can't even be seen. What you can't tell from this camera angle is that Carsen Edwards is standing behind the three-point line near the baseline. This looks like five on four. It is inexscusable.

NBA Camera Angles - Too Low

The low camera angle stinks in the half court. And it stinks when teams are in transition too. Here you can only see two players from each team.

NBA Camera Angles - Too Low

Other Strange Angles

It's not just the low camera angle. There are other questionable choices with the camera work. Here's what was shown when Grant Williams inbounded the ball from the sideline from the front court. You can only see part of the court, several players are off camera, and you can't see the basket at all.

NBA Camera Angles - Inbound

This is essentially the same thing, but with a birds-eye view. You still can't see all of the players or the basket.

NBA Camera Angles - Inbound

Who is Semi Ojeleye inbounding the ball to?

NBA Camera Angles - Inbound

Here we get an odd camera angle for a jump ball. That's 7' 6" Tacko Fall (one of the 40 tallest people in the world) for Boston and 6' 6" Troy Brown Jr. for Washington. They look the same size.

NBA Camera Angles - Jump Ball

This is another bad angle. Pop-quiz: who is shooting this free throw?

NBA Camera Angles - Free Throw

There's an abundance of bad camera angles for free throws. I counted around ten in the game where these images are from. It's too much.


Closing Thoughts

If I had the ear of those responsible, I'd want them to know they're trying too hard. I'd ask them to take to heart the old addage KISS: keep it simple, stupid. Stick with the traditional camera angles.


T-SQL Tuesday #135: Outstanding Tools

Dave Mason T-SQL Tuesday

For T-SQL Tuesday #135, Mikey Bronowski (b|t) asks us about the tools we use. What are the most helpful and effective tools you use or know of? Great questions, Mikey! I've got a few in mind for sure. And although I usually blog about SQL Server or something data-related, these are more generic minded tools...at least for us Windows folks.


Notepad

Yep, Notepad. I use it for all sorts of things. Whether it's code files, configuration/settings files, files for third-party apps, or just plain old text/log files, Notepad is my go-to plain text viewer and editor. Want to level up your Notepad skills? Create a shortcut to the Notepad.exe file and drop it in the "SendTo" folder. Now you can open any file with Notepad easily from Windows Explorer. Neat!


Paint.NET

I'm a proponent of "Show me, don't tell me". Screen captures go a long way toward that. Sure, Windows has Paint, but Paint.NET is a bit more advanced. I almost always have an instance of it open in the background. Hit the Print Screen keyboard button and paste (CTRL + V) it into Paint.NET (you can also use the ALT + Print Screen key combination to screen capture just the active window). From there you can do anything: trim down to just a specific part of the image, add some red arrows or circles for emphasis, blur/obscur any sensitive data that's in the image, etc. I take tweaked screen shots and paste them into just about anything...Word documents, email, even Twitter.


Spacesniffer

Running low on disk space and having a hard time figuring out where all your free storage went? There are numerous utilities that can help out. My favorite is SpaceSniffer. There's a zipped download that doesn't require a typical setup.exe installation. This makes it very convenient to copy to other machines you may be repsonsible for.


KeePass

If you're not using a password manager, I strongly suggest it. There are many good ones out there with different features and pros & cons. I've settled on KeePass. But if it's not KeePass, please use something!


7-Zip

Windows' built-in zip/compression functionality does what I need most of the time. If I need to password protect a zipped file, though, 7-Zip is the tool I use. It also has many additional features above and beyond what Windows provides (archive format, compression level, split to volumes, etc.).


Remote Desktop Connection Manager

If you find yourself connecting to a large number of virtual machines, managing those connections can be a chore. Before using Remote Desktop Connection Manager (aka "RDCMan"), I used to open the built-in Windows Remote Desktop Connection tool, hope the computer I wanted to connect to was in the drop-down list of computers, and manually enter my credentials. I got fancy and started using a DOS batch file for each computer I connected to regularly. But I still had to manually enter credentials. Enter Remote Desktop Connection Manager: it lets you create and organize groups of remote desktop connections. It handles credentials and a few other things too. Unfortunately, a vulnerability in the tool was discovered and Microsoft decided to retire the tool. If you have a different one to suggest, please drop me a note in the comments!


Headphones

The pandemic has resulted in many of us working from home. I'm sure we've all struggled during online meetings with poor audio and feedback that makes it hard for everyone to hear each other. Headphones are immensely better than using the built-in speakers and microphone of a laptop or desktop computer. This headset that I've been using for a few years is a Logitech model. It was cheap too--only about $30 (US). The microphone arm can be adjusted in different positions, and you can even flip it up and entirely out of the way if you just want to use the headset to listen to music. There's a physical mute button that is very handy, along with volume controls. It's USB too--I didn't have to install anything on Windows. Audiophiles may turn their noses up at this model. You can spend more (a lot more!) on a better microphone/headset setup that sounds ostensibly better. But compared to everyone else without a headset, you'll be playing chess while they're playing checkers.


Microsoft Certifications: an Evolving Stance

Dave Mason Microsoft Certification

I've been a longtime proponent of certifications. This is not a majority position--there are many who feel certs are worthless pieces of paper. But in my experience, the process of preparing for an exam has been an unmatched learning experience. Still, I haven't earned a ton of certifications. And sometimes I go for long stretches without pursuing one. In fact, it's been more than two years since my last cert. Why have I waited so long? In a word, Azure.

My preferred method to prep for a certification is a Microsoft Press book. They create the test, so it makes sense to use their study guide. Microsoft's new role-based certifications are Azure-centric. For those, there's not much out there. One of the explanations I had heard was that the Azure ecosystem changes so rapidly, by the time a book was published, a significant amount of information would be outdated. So Microsoft isn't in that business much any more (if at all). This makes sense. I've noted similar observations from numerous speakers in the SQL Server community--the slide decks and other content they meticulously pore over and compile quickly become outdated, forcing them to frequently update their presentations.

In turn, I think it's fair to say the current role-based Azure certification exams are moving targets. This poses two problems. One, how do I know I'm studying (and learning!) the right material? And two, how long after earning a cert will the material I learned be irrelevant?

I've heard of some people that can prep for a test in a few weeks or less. I am not one of those people. It usually takes me three to four months to prepare. I don't think that cadence works any more. As for irrelevancy, it's been exhausting trying to convince much of the IT world that certifications have value. A quickly outdated Azure cert makes it even harder to push that boulder up a hill.


Unpopular Opinion

If you have pearls nearby, now would be a good time to clutch them: I don't like Azure. Oh, hey...you're still here. Thanks for sticking around! The thing is, I don't dislike it, either. I just don't use Azure for much of anything. From what I've gathered, the lion's share of work is still happening on premises. But you wouldn't know it in looking at Microsoft's latest certificaiton offerings.

It's not just SQL Server either. Windows certifications are gone too. C# and .NET certifications for developers are suffering a similar fate. Microsoft's focus is clearly in the cloud. And that's ok. But what is to become of their flagship relational database management system, operating system, development platform, and programming language? In the absence of information from Redmond, all we can do is speculate. Without knowing anything about the path forward, my thoughts have turned gloomy.


Counterpoint

My hesitatancy to go all in with Azure has led me to ask myself a few questions. Am I being stubborn? Am I being lazy? Am I resistant to change? I don't rule any of those out. Those questions warrant a solid "maybe". I've got fewer years in my career ahead of me than behind, though. I need to be sensible about where I invest my time and energy. I'd also point out I haven't stopped learning, having recently finished two EdX courses (Data Science: R Basics and Data Science: Visualization).

Thinking back to my first certification experience (for Oracle), it didn't benefit me directly. I don't have regrets--I learned a lot of about database theory, development, and infrastructure. But if I had it to do over again, I'd have chosen a different path. I'll be watching the Microsoft cert landscape, hoping there will be new offerings for the on premises crowd. Until then, my instincts tell me there are better paths for me than Azure.


Adding Custom Transact-SQL Code Snippets

Dave Mason - SQL Server Management Studio

I have a number of scripts and queries I've written and curated over the years. They're not organized that well, scattered across different folders. Some are named poorly or grouped in a questionable manner. There are a handful that I tend to use the most. And yet with that small number, I sometimes have difficulty quickly finding a particular script (if I can find it at all), or spending too many mouse clicks to find it. It dawned on me recently to make use of code snippets.

Code snippets may have been intended primarily to aid code writing, but they can assist with administrative tasks too. Regardless of purpose, SSMS supports creating your own snippets, and we can create our first snippet in three steps:

  1. Copy an existing code snippet file.
  2. Edit it.
  3. Import it into SSMS.

Step 1

To get started, open the Code Snippets Manager from within SSMS via the Tools menu.

Dave Mason - SSMS Code Snippets Manager

Note the location where the built-in "Function" snippets are located, and then navigate to that path in Windows Explorer (you can highlight the path with your mouse and copy it, if desired).

Dave Mason - Code Snippets Manager

Find the "Begin End.snippet" file, and copy it. Create a "My Snippets" folder in your "My Documents" folder, paste the file into it, and rename the file to "Last Restart.snippet".


Step 2

Next, open the file in a text editor. I use Visual Studio 2019, but any editor will do. You'll probably notice the snippet file is comprised of XML data. Here's what it looks like:

<?xml version="1.0" encoding="utf-8" ?>
<CodeSnippets  xmlns="http://schemas.microsoft.com/VisualStudio/2005/CodeSnippet">
  <_locDefinition xmlns="urn:locstudio">
    <_locDefault _loc="locNone" />
    <_locTag _loc="locData">Title</_locTag>
    <_locTag _loc="locData">Description</_locTag>
    <_locTag _loc="locData">Author</_locTag>
    <_locTag _loc="locData">ToolTip</_locTag>
  </_locDefinition>
  <CodeSnippet Format="1.0.0">
    <Header>
      <Title>Begin</Title>
      <Shortcut></Shortcut>
      <Description>Code Snippet for Begin...End block.</Description>
      <Author>Microsoft Corporation</Author>
      <SnippetTypes>
        <SnippetType>SurroundsWith</SnippetType>
      </SnippetTypes>
    </Header>
    <Snippet>
      <Declarations></Declarations>
      <Code Language="SQL">
        <![CDATA[
BEGIN

$selected$ $end$

END
]]>
      </Code>
    </Snippet>
  </CodeSnippet>
</CodeSnippets>

Now we'll edit the file to create a "Last Restart" code snippet that inserts this query to an SSMS code tab/window:

SELECT i.sqlserver_start_time
FROM sys.dm_os_sys_info i;

The values for the Title, Description, and Author nodes (highlighted in yellow) become "Last Restart", "Query that returns the date and time SQL was last started.", and "Dave Mason @BeginTry" respectively. The SnippetTypes node (highlighted in orange) is not needed--delete it. The code for our snippet query (also highlighted in yellow) goes in the Code node, within the CDATA section. Here is the end result:

<?xml version="1.0" encoding="utf-8" ?>
<CodeSnippets  xmlns="http://schemas.microsoft.com/VisualStudio/2005/CodeSnippet">
  <_locDefinition xmlns="urn:locstudio">
    <_locDefault _loc="locNone" />
    <_locTag _loc="locData">Title</_locTag>
    <_locTag _loc="locData">Description</_locTag>
    <_locTag _loc="locData">Author</_locTag>
    <_locTag _loc="locData">ToolTip</_locTag>
  </_locDefinition>
  <CodeSnippet Format="1.0.0">
    <Header>
      <Title>Last Restart</Title>
      <Shortcut></Shortcut>
      <Description>Query that returns the date and time SQL was last started.</Description>
      <Author>Dave Mason @BeginTry</Author>
    </Header>
    <Snippet>
      <Code Language="SQL">
        <![CDATA[SELECT i.sqlserver_start_time
FROM sys.dm_os_sys_info i;]]>
      </Code>
    </Snippet>
  </CodeSnippet>
</CodeSnippets>


Step 3

Lastly, we'll import the code snippet to SSMS, so return to Code Snippets Manager. I had some trouble importing *.snippet files into the default "My Code Snippets" category/folder (perhaps a folder permission issue?). I removed "My Code Snippets", clicked the "Add.." button, navigated to and selected the "My Snippets" folder I had previously created within "My Documents". Expanding "My Snippets" reveals the new snippet that was created.

Dave Mason - Code Snippets Manager

Using The New Snippet

Click OK to close Code Snippets Manager (if it's still open) and open a new SSMS code tab/window.

  1. Right-click anywhere and select "Insert Snippet...".
  2. A drop-down list of snippet categories appears. Double-click the "My snippets" category folder.
  3. The "Last Restart" code snippet will be available (note the Tooltip). Double-click "Last Restart".
  4. The query from our snippet appears in the SSMS window.

Dave Mason - SSMS Insert Snippet
Dave Mason - SSMS Snippet Categories
Dave Mason - Last SQL Server Restart
Dave Mason - Last SQL Server Restart

Additional Snippets

If you want to add more snippets to SSMS, create a new *.snippet file, edit it, and copy it into the existing folder with "Last Restart.snippet". If the new snippet adheres to the Code Snippets Schema, it will show up automatically in SSMS.

I've created some other snippets that are available in this Github repository: BeginTry/SSMS-Code-Snippets. Many of them include replacement points, which will be of interest. Happy coding, everyone.


Transact-SQL Code Snippets

There are certain T-SQL statements whose syntax I have trouble remembering. When those situations arise, I might look up the syntax online; find the same type of object in SSMS, right-click it, script out the object, and use that as a starting point; or find one of my own scripts that has the syntax I'm looking for. Another option that I often overlook is T-SQL code snippets.

What is a code snippet? Here's the definition from the Microsoft documentation:

A Transact-SQL code snippet is a template containing the basic structure of a Transact-SQL statement or block. You can use snippets as a starting point when adding statements in the Database Engine Query Editor. You can insert the pre-defined snippets supplied with SQL Server, or create your own.

Here's what it looks like in SSMS:

  1. Right-click in an SSMS window and click "Insert Snippet..."
  2. A drop-down list of snippet categories appears. Double-click one of the category folders to see the snippets in that group.
  3. A drop-down list of code snippets appears. Hover your mouse over any of them to see the snippet definition tooltip. Double-click one to insert the code to the SSMS window.
  4. If the snippet has replacement points (highlighted in yellow), the TAB key will cycle through them allowing you to conveniently replace each with the values you want. If you rest your mouse pointer on a replacement point, a tooltip appears with a description.

Dave Mason - SSMS - Insert Snippet
Dave Mason - Snippet Categories
Dave Mason - Code Snippets
Dave Mason - Code Snippet Replacement Points

SSMS has pre-defined snippets for the following categories:

  1. Function
  2. Index
  3. Login
  4. Role
  5. Schema
  6. Stored Procedure
  7. Synonym
  8. Table
  9. Trigger
  10. User
  11. User Defined Data Type
  12. User Defined Table Type
  13. User Defined Type
  14. View

Check 'em out. And happy coding!

(Would you like to create your own, custom code snippets? Continue reading...)


Columnstore For Real-Time Operational Analytics

SQL Server Columnstore - Dave Mason

Starting with SQL Server 2016, new columnstore indexing features allow something called real-time operational analytics. Microsoft's documentation describes it as "the ability to run both analytics and OLTP workloads on the same database tables at the same time". Additionally, it can also "eliminate the need for ETL and a data warehouse".

I tend to work mostly with OLTP environments. Many of them have questionable designs or serve reporting workloads. Not surprisingly, there are a lot of performance-sapping table scans and index scans. I've compensated for this somewhat by using row and page compression, which became available on all editions of SQL Server starting with SQL Server 2016 SP1. Could I get even better results with columnstore indexes? Lets look at one example.

Here are four individual query statements from a stored procedure used to get data for a dashboard. If you add up percentages for Estimated Cost (CPU + IO), Estimated CPU Cost, or Estimated IO Cost, you get a total of about 90% (give or take a few percent).

SQL Server Columnstore - Dave Mason

The first two statements include a nonclustered index scan and the second two both have a clustered index scan. Each of the four scans consumes a large portion of the CPU and/or IO. Two nonclustered columnstore indexes were added. Now lets take a look at the before and after images of the query plans.


Statement 1

Index scan and index seek operators are both replaced by columnstore index scans. Note the table insert operation--presumably, its overall cost would be consistent. Yet as a percentage, it goes up significantly, becoming more "expensive" by virtue of the other operations being more efficient and less "expensive".

SQL Server Rowstore - Dave Mason

SQL Server Columnstore - Dave Mason


Statement 2

As in the previous statement, index scan and index seek operators are both replaced by columnstore index scans. We also see the table insert operation increase in cost.

SQL Server Rowstore - Dave Mason

SQL Server Columnstore - Dave Mason


Statements 3 and 4

The final two statements are nearly identical, so I won't repeat them. It's a different table this time, and it has a clustered index seek and a clustered index scan. Both operations are replaced by columnstore index scans.

SQL Server Rowstore - Dave Mason

SQL Server Columnstore - Dave Mason


With the nonclustered columnstore indexes in place, the relative costs of the four individual query statements go down significantly. Summed up, Estimated Cost (CPU + IO) is down to 58%, Estimated CPU Cost is down to 66%, and Estimated IO Cost is down to 55%.

SQL Server Columnstore - Dave Mason


The execution stats look even better. Elapsed time for the stored procedure drops from 15 seconds to 5 seconds. Logical reads also dropped by 80%. And there's even more good news. By keeping a close eye on index usage stats, I noticed some of the other rowstore indexes on the two tables were no longer being used for index scans. Various other executions plans were now using the new nonclustered columnstore index for those scans instead. Many of those plans saw similar performance gains with reduced CPU and IO.

I am still somewhat new to columnstore indexes, and my sample size is admittedly small. But so far, real-time operational analytics appears to be the best of both worlds. Reporting and analytical queries that scan wide swaths of data are working well with columnstore indexes, while existing rowstore indexes provide traditionally efficient seeks. Are you using nonclustered columnstore indexes on OLTP tables too? Please leave a comment! I haven't found anyone else that's doing this yet. Happy indexing, everyone.


Scraping Web Pages - Part 2

Dave Mason - RCurl XML

In the last post, we looked at a way to scrape HTML table data from web pages, and save the data to a table in SQL Server. One of the drawbacks is the need to know the schema of the data that gets scraped--you need a SQL Server table to store the data, after all. Another shortcoming is if there are multiple HTML tables, you need to identify which one(s) you want to save.

For this post, we'll revisit web scraping with Machine Learning Services and R. This time, we'll take a schema-less approach that returns JSON data. As before, this web page will be scraped: Boston Celtics 2016-2017. It shows two HTML tables (grids) of data for the Boston Celtics, a professional basketball team. The first grid lists the roster of players, the second is a listing of games played during the regular season.


R Code

This code uses the RCurl and XML packages (as was the case in the last post), plus the jsonlite package. There's a bit more code this time. Hopefully the code comments are helpful:

library(RCurl)
library(XML)
library(jsonlite)

#type.convert - sets appropriate class for each data frame variable/column.
TypeConvertDataFrameList <- function(lsDataFrame) {
    lapply(lsDataFrame, type.convert, as.is = TRUE)
}

url <- "https://itsalljustelectrons.blogspot.com/p/boston-celtics-2016-2017.html"
html = RCurl::getURL(url)

dfHtmlTableData <- XML::readHTMLTable(html, header = TRUE,
                                 as.data.frame = TRUE,
                                 stringsAsFactors = FALSE)
#remove null elements from list.
dfHtmlTableData <- dfHtmlTableData[!sapply(dfHtmlTableData, is.null)]

#convert data frame variables to appropriate classes.
dfHtmlTableData <- lapply(dfHtmlTableData, TypeConvertDataFrameList)

dfTables <- data.frame(url = rep(c(url), each = length(dfHtmlTableData)))

#Empty vector for JSON string data.
jsonStringsFactor <- rep(c(""), each = length(dfHtmlTableData))

if (length(dfHtmlTableData) > 0) {
    #Convert each HTML table to JSON.
    for (i in 1:length(dfHtmlTableData)) {
        jsonStringsFactor[i] <- jsonlite::toJSON(
      as.data.frame(dfHtmlTableData[[i]]), auto_unbox = TRUE)
    }
}

dfTables$HtmlTableAsJson <- jsonStringsFactor
dfTables

Lines 6-8 define a separate function that is invoked farther down in the script. Line 10 specifies the URL string for the web page we want to scrape. Line 11 gathers the entirety of HTML from the page. And Line 13 creates a list of data frames. At line 17, NULL elements (if any) are removed from the list. At line 20 the lapply is invoked. It iterates over the list of HTML table data and in turn invokes the TypeConvertDataFrameList from line 5. This function iterates over the variables of a data frame (much like columns of a database table) and attempts to cast them from a character (string) class to a different class, if appropriate. Without this, all of our scraped data will be returned as strings.

At line 22 we begin creating the single data frame that will be returned to SQL Server. A "url" variable (column) is created with one observation (row) for each HTML table found. Each value is the URL of the web page that was scraped. At line 25, a jsonStringsFactor vector is created with one element for each HTML table found. It has empty string values initially. In lines 29-31, the code iterates over each data frame elements in dfHtmlTableData. The data from each data frame element is converted to JSON and assigned to the corresponding element in jsonStringsFactor. Line 35 creates a "HtmlTableAsJson" property (column) for our data frame, and assigns jsonStringsFactor to it. We now have a data frame with two variables (columns) and three observations (rows). Here is the output of the code (truncated for display purposes):

url                                     HtmlTableAsJson                                          
---------------------------------       ----------------------------------------------------------------------------------------------------------------------------------------------------------------
https://...celtics-2016-2017.html       [{"Jersey.Number":0,"Player":"Avery Bradley","Position":"SG","Height":"6-3","Weight":180,"Birthdate":"11/26/1990","Years.Experience":"6","College":"Texas"},...]
https://...celtics-2016-2017.html       [{"Game":1,"Date":"Wed, Oct 26, 2016","Time":"7:30p","At":"","Opponent":"Brooklyn Nets","Result":"W","OT":"","Points.For":122,"Points.Against":117,"Wins.Sea...]
https://...celtics-2016-2017.html       [{}]


SQL Server

With the necessary R code in hand, we can run it in-database with Machine Learning Services. I've wrapped the code within a stored procedure:

CREATE OR ALTER PROCEDURE dbo.ScrapeHtmlTables
 @Url NVARCHAR(MAX)
AS
BEGIN

DECLARE @RScript NVARCHAR(MAX);
SET @RScript = N'library(RCurl)
library(XML)
library(jsonlite)

#type.convert - sets appropriate class for each data frame variable/column.
TypeConvertDataFrameList <- function(lsDataFrame){
  lapply(lsDataFrame, type.convert, as.is = TRUE)
}

url <- "' + @Url + '"
html = RCurl::getURL(url)

dfHtmlTableData <- XML::readHTMLTable(html, header = TRUE,
                                 as.data.frame = TRUE,
                                 stringsAsFactors = FALSE)
#remove null elements from list.
dfHtmlTableData <- dfHtmlTableData[!sapply(dfHtmlTableData, is.null)]

#convert data frame variables to appropriate classes.
dfHtmlTableData <- lapply(dfHtmlTableData, TypeConvertDataFrameList)

dfTables <- data.frame(url = rep(c(url), each = length(dfHtmlTableData)))

#Empty vector for JSON string data.
jsonStringsFactor <- rep(c(""), each = length(dfHtmlTableData))

if (length(dfHtmlTableData) > 0) {
    #Convert each HTML table to JSON.
    for (i in 1:length(dfHtmlTableData)) {
        jsonStringsFactor[i] <- jsonlite::toJSON(
      as.data.frame(dfHtmlTableData[[i]]), auto_unbox = TRUE)
    }
}

dfTables$HtmlTableAsJson <- jsonStringsFactor
OutputDataSet <- dfTables

OutputDataSet';

EXEC sp_execute_external_script 
 @language = N'R',
 @script = @RScript
WITH RESULT SETS (
 (
  [Url] VARCHAR(2048),
  HtmlTableAsJson VARCHAR(MAX)
 )
);
END
GO

Now we can retieve a web page's HTML table data by invoking a stored procedure and passing in a URL string. We don't need to know how many HTML tables there are, and we don't need to know their schemas:

CREATE TABLE #HtmlTableData (
 ID INT IDENTITY,
 [Url] VARCHAR(2048),
 HtmlTableAsJson VARCHAR(MAX)
)

INSERT INTO #HtmlTableData
EXEC dbo.ScrapeHtmlTables @Url = N'https://itsalljustelectrons.blogspot.com/p/boston-celtics-2016-2017.html';

SELECT *
FROM #HtmlTableData;

Dave Mason - RStats - sp_execute_external_script


JSON

From here on out, parsing the returned data is a JSON exercise. We would, of course, need to know the JSON schema to query it in a tabular or meaningful way. Or would we?

In another post, I described an automated way to infer JSON schema using dynamic TSQL. Perhaps that article will pique your interest. Here's a look at what I had in mind.

DECLARE @Json VARCHAR(MAX);
SELECT @Json = d.HtmlTableAsJson
FROM #HtmlTableData d
WHERE d.ID = 2;

EXEC dbo.OpenJsonTabular @Json;
Dave Mason - JSON