An Alternative to BCP

2020-03-10 2 Comments

Dave Mason - rstats

BCP and OPENROWSET are long-lived SQL Server options for working with data in external files. I've blogged about OPENROWSET, including a recent article showing a way to deal with quoted data. One of the shortcomings I've never been able to overcome is an inconsistent data file with data fields in some rows enclosed in double quotes, but not all. I've never found a way around this limitation.

Let's demonstrate with BCP. Below is a sample data file I'll attempt to load into a SQL Server table. Note the data fields highlighted in yellow, which are enclosed in double quotes and contain the field terminator , (comma) character. For reference, the file is also available on Github.

name,phone,DOB,email,SSN,job,company
Stanley Lind,5819626239,2013-07-16,denhj@q.ila,241-63-2894,Chartered loss adjuster,"Mante, Mante and Mante"
Hale Bins,3212496148,1903-02-25,piuylnbstd@gxqshjlek.zfk,206-93-3199,Sports development officer,"Borer, Borer and Borer"
Mikaela O'Keefe,5245629478,1927-09-01,hcpykwmbli@o.qfn,159-33-2513,Museum/gallery conservator,"Jacobs, Jacobs and Jacobs"
Renaldo Kub,3175312457,1918-12-10,ualhdze@xktdvileb.pla,696-91-1436,Comptroller,Jacobs-Jacobs
Keagan Hackett,1374574537,1992-12-01,wicybnlpkr@pgnmy.myq,803-23-7144,Private music teacher,Quitzon LLC
Kiana Cronin m.d.,3297585362,2017-01-24,ctrwj@cpmwtrus.jnx,133-84-8085,Fisheries officer,Witting Ltd
Harlow Durgan,9239627469,1918-04-13,qzfgcrkhno@qijz.que,858-94-5148,"Nurse, mental health",Wilkinson PLC
Sheila Donnelly-Blanda,4657648749,2005-07-19,spmzor@kl.uej,737-58-1638,"Therapist, occupational",Barrows PLC
Mr Colin Turcotte,5634176974,1952-01-18,ksagn@csyarmhd.aev,946-84-5702,Maintenance engineer,Kub-Kub
Lesta Krajcik,6244527243,1921-04-20,b@wmjisa.jec,432-99-2126,Nutritional therapist,White-White

Here is the bcp command:

TITLE "%~n0"
ECHO OFF
COLOR 0E
CLS

bcp tempdb.guest.ImaginaryPeople in "C:\Output\ImaginaryPeople.csv" -S .\SQL2017 -T -c -t, -r "\n" -F 2

pause

And here are the results when we query the newly loaded table. BCP still includes the double quotes as we see on the first three rows, although the , (comma) character is intact and in the right place within the column/field data. However, lines 7 and 8 are wrong. The "job" data is split across two columns/fields. And there's a field terminator , (comma) character now within the "company".

Dave Mason - BCP


An Alternative

Now let's look at an alternative to BCP. Using SQL Server Machine Learning services, we can harness all of the flexibility the R language has to offer (or Python or Java, depending on which version of SQL Server you are using). The code below truncates the table and executes sp_execute_external_script, invoking the R function read.csv().

TRUNCATE TABLE tempdb.guest.ImaginaryPeople;
INSERT INTO tempdb.guest.ImaginaryPeople
EXEC sp_execute_external_script  
 @language = N'R',
 @script = N'  
csvFile <- read.csv(file = "C:\\Output\\ImaginaryPeople.csv", 
                    sep = ",", 
                    header = TRUE,
                    numerals = "no.loss")
csvFile$phone <- as.character(csvFile$phone)',
 @output_data_1_name = N'csvFile';

Here are the query results for the newly loaded table. The double quotes that previously enclosed data columns/fields are gone, all the , (comma) characters are where they should be, and all the field data are in the correct columns.

Dave Mason - sp_execute_external_script - read.csv


Dave's Thoughts

BCP and OPENROWSET have stood the test of time, and rightly so. They're great options. And while you might argue they're imperfect, I will be quick to remind you they are extremely fast. In situations where those tools may not work or are not preferable, there are certainly other options. (SSIS, PowerShell, and C# come to mind.) Maybe R, Python, or Java in-database are other options you'd consider? Comments welcome below. Happy coding!

2 comments:

  1. We have to deal with multi line fields a lot (carriage returns inside quoted text). How is the read.csv() function coping with these?

    ReplyDelete
  2. Thanks for sharing. I usually use bcp out to extract data not bcp in because of these parsing issues. The tip you demonstrated here will work also for bulk insert.
    Thanks.

    ReplyDelete