A TSQL Developer Contemplates CHAR(82)

2017-08-17 0 Comments

Dave Mason - R dplyr

I've been reading R Programming for Data Science and trying to make some sense of Big Data and how it fits into my world as a SQL Server professional. To be frank, it's been slow going. After three months, I'm not even halfway through the book, which isn't that long to begin with. In my defense, I have gone to three SQL Saturdays in that span, along with writing nine blog posts. I could have done worse.

I've considered blogging about #R a few times. It's not all that crazy, since I'm reading a book about it. The big issue is that I don't feel like I have much to say that isn't inaccurate, boring, or painfully awkward. I noted this in a tweet a few days ago:

I suppose it's time to break the seal. If you need a few shots to get through this, I won't be offended. So let's talk about R.

I think my first exposure to R was at PASS Summit 2016. Since then, I've made an effort to attend R sessions at SQL Saturdays. The one commonality I seem to find in all of them is a demo with (or mention of) the dplyr package. It's a package of functions that manipulate data in data frame objects (think of them as SQL Server/relational tables...or if you're a .NET developer, a System.Data.DataTable object). R feels inexorably tied to dplyr at this early stage for me. R is probably way more vast than I realize, but what would it be without dplyr? Would it still be as popular? Would it still be as powerful?

What's It Good For

I'm not sure if I'm perceiving this the right way yet, but dplyr sure feels a lot like LINQ, a .NET Framework technology that provides query-like capability for C#. For instance, you can select a subset of objects from an array, sort them, find a minimum or maximum, etc. It's kind of like querying SQL Server, just without SQL Server.

LINQ debuted with version 3.5 of the .NET Framework, which wasn't too long before I made the switch from developer to DBA. I never got the chance to use it much. But I remember how things were back in the VB6 days. Let's say I had an array of numbers, something like this:

How would I find the minimum? Looking back, I'd probably create a temp variable, store the first value in it (2), then loop through the rest of the numbers in the array. With each iteration, I'd compare the value of the temp variable to the current value in the loop. If the loop item was less than the temp variable, its value would be stored in the temp variable. When I got to the end of the loop, I'd have the smallest value of the array.

Conceptually, it's pretty easy. It's not that hard code-wise either. But what if I wanted to sort the array? Well that's a little more complicated. VB6 probably would have made me declare another 10-item array of numbers (all zeros). And then...you know what, let's not relive it. Let's just say it's a fair amount of work, shall we?

I was frustrated enough by the complexity of such a simple task that I wanted to write some dynamic TSQL, throw it at SQL Server, let it sort the numbers for me, and return them. (I solved lots of development problems with the TSQL hammer in those days.) Other developers were frustrated by it too. One of them showed me his trick of loading numbers into a combobox or listbox control and setting the "sorted" property to true. These were the kinds of things we did to avoid the ickyness of shuffling data around in memory.

With LINQ, you can "query" and sort your object data without all the programming gymnastics. And with dplyr, you can do much of the same with data frames. So far, I've read about select(), filter(), arrange(), mutate(), and group_by() functions. Those are all very T-SQL-ish. Maybe there will be more as I continue reading.

Thinking back to the R sessions from PASS Summit and SQL Saturdays, I get why all of the speakers included dplyr in their presentations. It makes total sense. I think.