2017-10-02

Regular Expressions With R And T-SQL

Dave Mason SQL Server R RegEx

Have you ever had the need to use Regular Expressions directly in SQL Server? I sometimes hear or see others refer to using RegEx in TSQL. But I always assume they're talking about the TSQL LIKE operator, because RegEx isn't natively supported. In TSQL's defence, you can get a lot of mileage out of LIKE and some clever pattern matching strings, even though it's not authentic RegEx. You can leverage RegEx libraries in the .NET Framework via a CLR stored procedure. You should also be able to do something similar with an old-school extended stored procedure.

I discussed all of this during a recent interview. It was a day or two afterwards (of course) when it dawned on me that there's another way to leverage RegEx from TSQL: the R language. Prior to this mini-revelation, I had always thought of R (and Python) as strictly a means to an end for Data Science and related disciplines. Now I am thinking I've been looking at R and Python through too narrow of a lens and I should take a larger view.

RegEx In The Real World

Coming up with a good example that you can't already do with TSQL was a challenge. In fact, I'd say I failed. I remember working on an ETL project once where there was some data validation that I performed with TSQL and dozens of LIKE clauses. It worked for nearly 100% of the data, but not all of it: I would have invoked RegEx from TSQL if it was an option. I can't remember if it was for email addresses or physical mailing addresses. For this example I'll use email addresses.

It turns out getting a fool-proof RegEx string search pattern for validating email addresses is damn complicated. To keep things simple, I nabbed a couple from two StackOverflow questions. (They're highlighted in yellow.) I'll query some "people" via the SQL Server database engine, pass the data for those people to R, let R decide which people have invalid email addresses, and have R pass back that subset of people to SQL Server. The "people" are from the [Application].[People] table in the [WideWorldImporters] sample database. They get passed to the R engine as a dataframe named "InputDataSet". R uses the grepl function with the "not" operator (exclamation point!) to find which people have email addresses that don't match the RegEx string search pattern.

EXEC sp_execute_external_script 
 @language = N'R',
 @script = N' RegexWithR <- InputDataSet;
OutputDataSet <- RegexWithR[!grepl("^[[:alnum:].-_]+@[[:alnum:].-]+$", RegexWithR$EmailAddress), ];',
 @input_data_1 = N'SELECT PersonID, FULLName, EmailAddress FROM Application.People'
 WITH RESULT SETS (([PersonID] INT, [FullName] NVARCHAR(50), [EmailAddress] NVARCHAR(256)))

This example returns 46 rows. If I'm understanding the RegEx pattern, it doesn't like email addresses with spaces, hyphens, or unicode characters.


Here's another example that returns fewer rows. This time it looks like it's just email addresses with the spaces and hyphens that don't meet the RegEx criteria.

EXEC sp_execute_external_script 
 @language = N'R',
 @script = N' RegexWithR <- InputDataSet;
OutputDataSet <- RegexWithR[!grepl("([_a-z0-9-]+(\\.[_a-z0-9-]+)*@[a-z0-9-]+(\\.[a-z0-9-]+)*(\\.[a-z]{2,4}))", RegexWithR$EmailAddress), ];',
 @input_data_1 = N'SELECT PersonID, FullName, EmailAddress FROM Application.People'
 WITH RESULT SETS (([PersonID] INT, [FullName] NVARCHAR(50), [EmailAddress] NVARCHAR(256)))


As I had mentioned, you could write TSQL that accomplishes the same R examples I've shown here. But as the searches get more complex, the TSQL code gets progressively larger, while the RegEx code requires nothing more than changing the search string expression. There's a threshold with TSQL where it makes more sense to work with RegEx. Granted, that threshold is different depending on the developer.

Then there's the option of using a CLR function. It has some obstacles to overcome in the SQL Server environment, just as using R does. There are some pros and cons with each approach. Let's not say one or the other is better or worse. Let's simply recognize the different approaches.

Data science still feels a world away. But I've finally found a tangible example of something I could use R for. I hope to find some other uses soon. Watch out, world! I think I just earned my R training wheels.


SHARE

No comments :

Post a Comment

Subscribe