2018-03-04

STRING_AGG Return Types

Dave Mason SQL Server STRING_AGG

I had a chance to use the STRING_AGG function recently, or so I thought. Since I couldn't use it for "real world" work, I decided to give it a test drive on my own. My initial thought was to do some performance testing and blog about that, but I ran into an interesting problem that I'll discuss here for now.

I started by populating a table with some "string" data:

SELECT TOP(100000)
 CAST(ROW_NUMBER() OVER (ORDER BY o.object_id) % 10 AS CHAR(1)) AS Digit,
 CAST(ROW_NUMBER() OVER (ORDER BY o.object_id) AS VARCHAR(16)) AS RowNumber,
 o.name AS ObjName
INTO tempdb.guest.Strings
FROM sys.objects o
CROSS JOIN sys.objects o2
CROSS JOIN sys.objects o3

My very first query attempt with STRING_AGG bombed:

SELECT STRING_AGG(s.digit, CHAR(13) + CHAR(10))
FROM tempdb.guest.Strings s

Msg 9829, Level 16, State 0, Line 5
STRING_AGG aggregation result exceeded the limit of 8000 bytes. Use LOB types to avoid result truncation.

OK, fair enough. I've got 100k rows, times 3 bytes per row. That's well beyond the 8k bytes threshold. But why is that threshold only 8k bytes? Microsoft Docs has some answers. The [digit] column that I pass in as the first argument is data type CHAR(1). It's not a unicode data type, so I'll assume the return type is the same as for VARCHAR. Casting the [digit] column to VARCHAR allows my query to run:

SELECT STRING_AGG(CAST(s.digit AS VARCHAR(MAX)), CHAR(13) + CHAR(10))
FROM tempdb.guest.Strings s

I guessed the data type conversion via CAST might slow things down. And it probably does to a certain extent. But the results come back fast. Really fast. Now, back to that issue with the 8k limit. Here's something else I tried that still didn't work:

SELECT TOP(100) STRING_AGG(s.digit, CHAR(13) + CHAR(10))
FROM tempdb.guest.Strings s

Msg 9829, Level 16, State 0, Line 5
STRING_AGG aggregation result exceeded the limit of 8000 bytes. Use LOB types to avoid result truncation.

100 rows times 3 bytes per row...that should work, right? If I look at the estimated execution plan, there's a table scan. The estimated number of rows is 100k, not 100. So I can see why the optimizer isn't cooperating.

Dave Mason SQL Server STRING_AGG Table Scan

This is a little bit of a mystery for me, but I'm not going to lose any sleep over it. (Please leave a comment if you know something that I don't.) As shown above, using CAST to VARCHAR(MAX) works. Even though the performance of STRING_AGG appears really fast, I'm hoping to do some further testing to compare its performance to the old way of concatenating strings from a result set. Stay tuned.


SHARE

1 comment:

  1. STRING_AGG is an aggregate function, and the TOP operator is "outside" the function, so it will do the aggregate first, then the TOP second. The TOP will limit the number of rows returned, not the number of rows aggregated.

    Think of a similar query with SELECT TOP 10 SUM(col2)...GROUP BY col1 ... The TOP will give you 10 values for col1, and an otherwise unrestricted sum.

    You could use a CTE to limit to 10, then a STRING_AGG on the CTE.

    ReplyDelete