Generating Fake Data in SQL Server With R

2020-02-01 2 Comments

SQL Server - sp_execute_external_script - R

I've been thinking about R and how it can be used by developers, DBAs, and other SQL Server professionals that aren't data scientists per se. A recent article about generating a data set of fake transactional data got me thinking about this again and I wondered, can R be used to obfuscate PII data?

In a word, yes. Well, mostly. (More on this in a bit.) As with anything R-related, there are probably multiple packages that are useful for any given task. For this one, I'll focus on the "generator" package. It includes the following functions (whose names should be fairly self-explanatory) for generating fake data:

  • r_credit_card_numbers
  • r_date_of_births
  • r_email_addresses
  • r_full_names
  • r_ipv4_addresses
  • r_latitudes
  • r_longitudes
  • r_national_identification_numbers
  • r_phone_numbers

All of the functions have a required n parameter for the number of observations to generate. For dates of birth and phone numbers, there are some additional optional parameters for customization. Here's an example (using the functions I find most useful) via SQL Server's Machine Learning Service and sp_execute_external_script:

EXEC sp_execute_external_script 
 @language = N'R',
 @script = N'library(generator)
num_observations <- 100
df <- as.data.frame(r_full_names(num_observations))
df$phone <- r_phone_numbers(num_observations, use_hyphens = TRUE)
df$DOB <- as.character(r_date_of_births(num_observations))
df$email <- r_email_addresses(num_observations)
df$SSN <- r_national_identification_numbers(num_observations)
OutputDataSet <- df'
WITH RESULT SETS ((
 FullName VARCHAR(255),
 PhoneNumber VARCHAR(16),
 DOB DATE,
 EmailAddress VARCHAR(255),
 SSN CHAR(11)
));

R generator package


Dave's Thoughts

This presents a really interesting option, although there are many questions I haven't broached that are outside the scope of this post. Generating random dates and digits is fairly trivial with T-SQL. Generating random (and meaningful) strings is not. Those fake names are pretty cool, although if you are outside of the US, the diversity of those names (or lack of) may be a concern. The email addresses are not realistic in my opinion. They look too random, as you can see by looking at a few of the top-level domain name suffixes. (.lhw? .nxa?)

There's probably not enough functions in the generator package to roll your own complete PII obfuscation solution. However, it appears to be version 1.0. The documentation notes functions for home address and birthplace (among others) are "To be added". If those functions arrive any time soon, I'll be considering using "generator" for some non-production environments.

Are there any other R packages you know of for generating fake PII? Leave a message in the comments!

2 comments:

  1. Hi, it is very nice script but I am getting error
    Msg 39004, Level 16, State 20, Line 0
    A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.
    Msg 39019, Level 16, State 1, Line 0
    An external script error occurred:
    Error in library(generator) : there is no package called 'generator'
    Calls: source -> withVisible -> eval -> eval -> library

    Error in ScaleR. Check the output for more information.
    Error in eval(expr, envir, enclos) :
    Error in ScaleR. Check the output for more information.
    Calls: source -> withVisible -> eval -> eval -> .Call
    Execution halted
    Msg 11536, Level 16, State 1, Line 1
    EXECUTE statement failed because its WITH RESULT SETS clause specified 1 result set(s), but the statement only sent 0 result set(s) at run time.

    ReplyDelete
  2. Hi Uri,
    You'll need to install the "generator" package first. Check out my previous article for a walkthrough.

    ReplyDelete