Fuzzy strings matching using Levenshtein algorithm on SQL Server (T-SQL vs CLR)

Levenshtein algorithm is one of possible fuzzy strings matching algorithm. Levenshtein algorithm calculates Levenshtein distance which is a metric for measuring a difference between two strings. The Levenshtein distance is also called an edit distance and it defines minimum single character edits (insert/updates/deletes) needed to transform one string to another. Details on the algorithm itself can be found on Wikipedia.
When you need to use it in queries, functions or stored procedures you have two possibilities – T-SQL implementation and CLR implementation.
I will show both solutions here and also compare the speed of both solutions.

T-SQL implementation of Levenshtein algorithm

For T-SQL I will took the one I have found on the SqlTeam.com forums a which was originally developed by Joseph Gama as mentioned in the SqlTeam post.

CREATE FUNCTION [dbo].[edit_distance](
  @s1 nvarchar(3999),
  @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
  DECLARE @s1_len int, @s2_len int
  DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
  DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)
  SELECT
    @s1_len = LEN(@s1),
    @s2_len = LEN(@s2),
    @cv1 = 0x0000,
    @j = 1, @i = 1, @c = 0

  WHILE @j <= @s2_len
    SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
  WHILE @i <= @s1_len

  BEGIN
    SELECT
      @s1_char = SUBSTRING(@s1, @i, 1),
      @c = @i,
      @cv0 = CAST(@i AS binary(2)),
      @j = 1

    WHILE @j <= @s2_len
    BEGIN
      SET @c = @c + 1
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
        CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
      IF @c > @c_temp SET @c = @c_temp
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
      IF @c > @c_temp SET @c = @c_temp
      SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
    END
    SELECT
      @cv1 = @cv0,
      @i = @i + 1
  END

  RETURN @c
END

CLR implementation of Levenshtein algorithm

public class FuzzyStrings{
  /// <summary>
  /// Calculates the Levenshtein Distance between two strings.
  /// It is minimum of single character insert/delete/update operations needed to transfrom
  /// first string into the second string
  /// </summary>
  /// <param name="firstString">First string to calculate the distance</param>
  /// <param name="secondString">Second string to calculate the distance</param>
  /// <param name="ignoreCase">Specifies whether to ignore case in comparison</param>
  /// <returns>int represending the Levenshtein Distance</returns>
  public static int LevenshteinDistance(SqlString firstString, SqlString secondString, SqlBoolean ignoreCase)
  {
    string strF = ignoreCase ? firstString.Value.ToLower() : firstString.Value;
    string strS = ignoreCase ? secondString.Value.ToLower() : secondString.Value;
    int lenF = strF.Length;
    int lenS = strS.Length;
    int[,] d = new int[lenF + 1, lenS + 1];

    for (int i = 0; i <= lenF; i++)
      d[i, 0] = i;
    for (int j = 0; j <= lenS; j++)
      d[0, j] = j;

    for (int j = 1; j <= lenS; j++)
    {
      for (int i = 1; i <= lenF; i++)
      {
        if (strF[i - 1] == strS[j - 1])
          d[i, j] = d[i - 1, j - 1];
        else
          d[i, j] = Math.Min(Math.Min(
            d[i - 1, j] + 1,        // a deletion
            d[i, j - 1] + 1),       //an Insertion
            d[i - 1, j - 1] + 1);   // a substitution
      }
    }

    return d[lenF, lenS];
  }
}

You need to compile the code using e.g. C# Express into a an assembly and create the assembly in DB. In my Case the assembly is named [SQLCLR].

CREATE FUNCTION [dbo].[fn_LevenshteinDistance](
  @firstString [nvarchar](4000),
  @secondString [nvarchar](4000),
  @ingoreCase [bit] = 1
)
RETURNS [int]
WITH EXECUTE AS CALLER
AS EXTERNAL NAME [CLRSQL].[FuzzyStrings].[LevenshteinDistance]
GO

Testing of the functions

Once we have properly created both functions we can start testing it. Here is a script for test of both functions on several strings.

SELECT
  dbo.edit_distance('Sunday', 'Monday') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('Sunday', 'Monday', 1) AS CLRDistance

UNION ALL 

SELECT
  dbo.edit_distance('Sunday', 'Sunday') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('Sunday', 'Sunday', 0) AS CLRDistance

UNION ALL

SELECT
  dbo.edit_distance('Sunday', 'sunday') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('Sunday', 'sunday', 0) AS CLRDistance

UNION ALL

SELECT
  dbo.edit_distance('Saturday', 'Monday') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('Saturday', 'Monday', 1) AS CLRDistance

UNION ALL

SELECT
  dbo.edit_distance('This is a first string to Compare', 'This is a second string to Compare') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('This is a first string to Compare', 'This is a second string to Compare', 1) AS CLRDistance

And here are the results of the tests:

TSQLDistance CLRDistance
------------ -----------
2            2
0            0
0            1
5            5
6            6

As we can see, the functions return the same results except the second select. As we have selected not to ignore the case in the CLR version. For T-SQL strings are identical as the function uses a default collation which I have Case Insensitive. If I had a default collation Case Sensitive, then the results will be the same. It could be also possible to modify the T-SQL function to accept parameter for Case Sensitive/Insensitive comparison and then use different collations for that, but it’s not what we want to do here.

Speed comparison

As we saw in previous paragraphs here, both T-SQL and CLR version of the algorithm woks correctly. Now take a look on the calculation speed of the Levenshtein distance by both version.

For the test we can use a simple script, which will calculate the Levenshtein distance in cycle 10 000 times

DECLARE
   @TSQLStartTime datetime,
  @TSQLEndTime datetime,
  @CLRStartTime datetime,
  @CLREndTime datetime,
  @distance int,
  @i int

SELECT
  @i = 0,
  @TSQLStartTime = GETDATE();

WHILE (@i < 10000)
BEGIN
  SELECT
    @distance = dbo.edit_distance('This is a first string to Compare', 'This is a second string to compare'),
    @i = @i + 1
END

SELECT
  @TSQLEndTime = GETDATE(),
  @i = 0,
  @CLRStartTime = GETDATE()

WHILE (@i < 10000)
BEGIN
  SELECT
    @distance = [ClrSafe].fn_LevenshteinDistance('This is a first string to Compare', 'This is a second string to compare', 1),
    @i = @i + 1
END

SELECT @CLREndTime = GETDATE()

SELECT
  DATEDIFF(millisecond, @TSQLStartTime, @TSQLEndTime) AS TSQLDuration,
  DATEDIFF(millisecond, @CLRStartTime, @CLREndTime) AS CLRDuration

Here are the results

TSQLDuration CLRDuration
------------ -----------
22993        763

As we can see the 10 000 times calculation using the T-SQL version took 22993 milliseconds which is in average circa 2.3 millisecond for calculating the distance for the strings in our test query.

On the other side the 10 000 times calculation using CLR took only 763 milliseconds which is in average circa 0.08 milliseconds for calculating the distance for the same strings as in T-SQL version.

Conclusion

From the results we can see that the CLR is about 30 times faster on the same machine than the T-SQL version of the same algorithm. Even the T-SQL version took only 2.3 milliseconds per calculation of sample texts and it’s quite good to use in normal usage, the use or CLR can enormously decrease the processing time when processing higher volume of records.

Also it is another example where CLR beats the T-SQL solution and where it has sense to use the CLR instead of pure T-SQL.

Advertisements

Fastest CSV strings splitting using CLR (T-SQL vs. CLR revisited)

In one my previous blog post SQL Server – String splitting (T-SQL vs. CLR) I was comparing speed of T-SQL solution for string splitting vs. CLR RegEx solution. Although the CLR RegEx is fast enough, it isn’t the fastest solution for simple CSV string splitting. Also I will again compare it to the T-SQL solution.

In the mentioned post the T-SQL solution wasn’t usable for larger amount of CSV values, but after some investigations by Jeff Moden in post http://www.sqlservercentral.com/Forums/FindPost997236.aspx found a mistake we both made in the T-SQL Testing, and therefore I will post here also updated comparison to the T-SQL version

Fastest CLR version

Probably fastest CLR version for splitting sting is a CLR table-valued function which processes the whole string on character level and on a delimiter sends the results.

Here is one possible solution.

public class StringSplit
{
    private struct StrRow
    {
        public StrRow(int rowId, SqlChars value)
        {
            RowId = rowId;
            Value = value;
        }

        public int RowId;
        public SqlChars Value;

    }

    [SqlFunction(FillRowMethodName = "FillSplitString3")]
    public static IEnumerable SplitString3(SqlString sourceString, string delimiter, int maxLen)
    {
        char[] buffer = new char[maxLen];
        char delim = delimiter[0];
        int rowNumber = 0;
        int chars = 0;
        char[] finalString;

        foreach (char chr in sourceString.Value)
        {
            if (chr == delim)
            {
                finalString = new char[chars];
                Array.Copy(buffer, finalString, chars);
                yield return new StrRow(++rowNumber, new SqlChars(finalString));
                chars = 0;
            }
            else
            {
                buffer[chars++] = chr;
            }
        }
        if (chars > 0)
        {
            finalString = new char[chars];
            Array.Copy(buffer, finalString, chars);
            yield return new StrRow(++rowNumber, new SqlChars(finalString));
        }

    }

    [SqlFunction(FillRowMethodName = "FillSplitString3")]
    public static IEnumerable SplitString4(SqlString sourceString, string delimiter)
    {
        StringBuilder sb = new StringBuilder();
        char delim = delimiter[0];
        int rowNumber = 0;
        foreach (char chr in sourceString.Value)
        {
            if (chr == delim)
            {
                yield return new StrRow(++rowNumber, new SqlChars(sb.ToString()));
                sb = new StringBuilder(sb.Capacity);
            }
            else
            {
                sb.Append(chr);
            }
        }
        if (sb.Length > 0)
        {
            yield return new StrRow(++rowNumber, new SqlChars(sb.ToString()));
        }

    }

    public static void FillSplitString3(object obj, out int rowId, out SqlChars value)
    {
        StrRow r = (StrRow)obj;
        rowId = r.RowId;
        value = r.Value;
    }
}
CREATE FUNCTION dbo.fn_SplitString3(
  @sourceString nvarchar(max),
  @delimiter nchar(1),
  @maxLen int
)
RETURNS  TABLE (
    RowID int NULL,
    Value nvarchar(10) NULL
) WITH EXECUTE AS CALLER
AS
EXTERNAL NAME SQLRegEx.StringSplit.SplitString3
GO

This function takes three parameters. First the source string to be split, delimiter and maxLen, which is maximum length for an item in the CSV List. It is used to allocate buffer. And e.g.. for integer values it will be 10 as positive integer will have maximum of 10 digits. It is possible to write this function also without this parameter, but I’ve added it because of speed, as it doesn’t require buffer reallocations.

I will compare the speed also to the CLR RegEx version. I will use the function mentioned in my previous post.

For CLR RegEx we will use a simple Regular expression ”d+” as it is enough for the integer values delimited by commas.

As T-SQL candidate for speed comparison I will use the latest optimized version of Tally table splitting by Jeff Moden.

CREATE FUNCTION dbo.Split8KTallyM (
    @Parameter VARCHAR(8000),
    @Delimiter VARCHAR(1)
)
RETURNS @Result TABLE (ItemNumber INT, ItemValue INT) AS
  BEGIN
 INSERT INTO @Result
        (ItemNumber, ItemValue)
 SELECT CAST(ROW_NUMBER() OVER (ORDER BY N) AS INT) AS ItemNumber,
        SUBSTRING(@Parameter,N,CHARINDEX(@Delimiter,@Parameter+@Delimiter,N)-N) AS ItemValue
   FROM dbo.Tally
  WHERE N BETWEEN 1 AND LEN(@Parameter)+1
    AND SUBSTRING(@Delimiter+@Parameter,N,1) = @Delimiter; --Notice how we find the comma
 RETURN
    END;
GO

Test data preparation

I will use as test data the same tables as in previous tests. We will use table with 10 000 rows and each will be with different length of CSV string (16 items, 100 items and 1333 items). The table definition will be only modified and the string will not be stored as nvarchar(max) but as varchar(max). The nvarchar in previous test totally degraded the T-SQL solution so it was not usable for 1333 item in SCV string.

SELECT TOP 11000
    IDENTITY(INT, 1, 1) AS N
INTO dbo.Tally
FROM sys.all_objects o1, sys.all_objects
GO

--Add Clustered Index on Tally table
ALTER TABLE dbo.Tally
    ADD CONSTRAINT PK_Tally PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
GO

--Create and populate CsvTest table (doesn't matter whether the table has Clustered index or it is simply heap)
SELECT TOP (10000) --Controls the number of rows in the test table
    ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
    (
        SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
                ( --=== This builds CSV row with a leading comma
                SELECT TOP (16) --Controls the number of CSV elements in each row
                    ','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
                FROM dbo.Tally t3      --Classic cross join pseudo-cursor
                CROSS JOIN dbo.Tally t4 --can produce row sets up 121 million.
                WHERE t1.N <> t3.N --Without this line, all rows would be the same
                FOR XML PATH('')
                )
                ,1,1,'') AS VARCHAR(8000))
                ) AS CsvParameter
INTO CsvTest
FROM dbo.Tally t1        --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t2;  --can produce row sets up 121 million.
GO

SELECT TOP (10000) --Controls the number of rows in the test table
    ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
    (
        SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
                ( --=== This builds CSV row with a leading comma
                SELECT TOP (100) --Controls the number of CSV elements in each row
                    ','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
                FROM dbo.Tally t3      --Classic cross join pseudo-cursor
                CROSS JOIN dbo.Tally t4 --can produce row sets up 121 million.
                WHERE t1.N <> t3.N --Without this line, all rows would be the same
                FOR XML PATH('')
                )
                ,1,1,'') AS VARCHAR(8000))
                ) AS CsvParameter
INTO CsvTest2
FROM dbo.Tally t1        --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t2;  --can produce row sets up 121 million.
GO

SELECT TOP (10000) --Controls the number of rows in the test table
    ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
    (
        SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
                ( --=== This builds CSV row with a leading comma
                SELECT TOP (1333) --Controls the number of CSV elements in each row
                    ','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
                FROM dbo.Tally t3      --Classic cross join pseudo-cursor
                CROSS JOIN dbo.Tally t4 --can produce row sets up 121 million.
                WHERE t1.N <> t3.N --Without this line, all rows would be the same
                FOR XML PATH('')
                )
                ,1,1,'') AS VARCHAR(8000))
                ) AS CsvParameter
INTO CsvTest3
FROM dbo.Tally t1        --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t2;  --can produce row sets up 121 million.
GO

Speed comparison

Here is a script I will use to compare the speed:

--================= 16 items ==========
GO
--CLR fn_SplitString3
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest D
CROSS APPLY dbo.fn_SplitString3(D.CsvParameter, ',', 10) V
GO
--CLR RegEx
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest D
CROSS APPLY dbo.fn_RegExMatches2(D.CsvParameter, 'd+') V
GO
--T-SQL Split8KTallyM
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.ItemNumber,
    @ItemValue = V.ItemValue
FROM dbo.CsvTest D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
GO
--================= 100 items ==========
GO
--CLR fn_SplitString3
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest2 D
CROSS APPLY dbo.fn_SplitString3(D.CsvParameter, ',', 10) V
GO
--CLR RegEx
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest2 D
CROSS APPLY dbo.fn_RegExMatches2(D.CsvParameter, 'd+') V
GO
--T-SQL Split8KTallyM
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.ItemNumber,
    @ItemValue = V.ItemValue
FROM dbo.CsvTest2 D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
GO
--================= 1333 items ==========
GO
--CLR fn_SplitString3
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest3 D
CROSS APPLY dbo.fn_SplitString3(D.CsvParameter, ',', 10) V
GO
--CLR RegEx
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest3 D
CROSS APPLY dbo.fn_RegExMatches2(D.CsvParameter, 'd+') V
GO
--T-SQL Split8KTallyM
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.ItemNumber,
    @ItemValue = V.ItemValue
FROM dbo.CsvTest3 D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
GO

And here are the results from profiler:

Profiler results

Results of comparison and conclusion

As we can see in the output from profiles, the new fn_SplitString3 function is unbeatable in all scenarios. While the T-SQL took 3.5 seconds for 16 items, the new CLR split function takes only 253 milliseconds. As mentioned in previous post, the CLR RegEx benefits at higher items count over 100. And in higher counts beats the T-SQL Solutions. The new fn_SplitString even on 1333 items count took only 8.2 sec.

Calculating running total for last X rows – UPDATED

This is another blog post about running totals and again I will use CLR and demonstrate the power of CLR in such situations. Again this post is inspired by an article on SQL Server Central – Calculate the Running Total for the last five Transactions by Divya Agrawal.

This is a kind of running totals when we want to sum only a specific number of last rows.

The CLR function will be nothing than a slightly modification of the running totals function from my previous blog post: SQL Sever and fastest running totals using CLR.

I’ve updated this blog post based on the update of mentioned previous post, after communication with Paul White and added the security check into the CLR function to detect processing of rows out of expected order. Please read the previous post for more details.

Test data preparation

I will use sample data from the article, so it is possible to compare the solutions and we will be able to see the difference.

CREATE TABLE Accounts
(
    ID int IDENTITY(1,1),
    TransactionDate datetime,
    Balance float
)
GO

INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/1/2000',100)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/2/2000',101)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/3/2000',102)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/4/2000',103)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/5/2000',104)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/6/2000',105)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/7/2000',106)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/8/2000',107)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/9/2000',108)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/10/2000',109)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/11/2000',200)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/12/2000',201)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/13/2000',202)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/14/2000',203)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/15/2000',204)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/16/2000',205)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/17/2000',206)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/18/2000',207)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/19/2000',208)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/20/2000',209)
GO

CLR Solution

As I mentioned above the CLR solution is only a modification of function from previous blog post. We only add a queue of last calculated running totals.

Calculation of Running Total in Excel

The image above displays situation when we want to make running total for last 3 rows. Our solution will calculate the running total continuously as it is done in column B, but when the requested count of 3 rows is met, it will subtract the running total 3 rows back as can be seen on the image in column B. For this we need in our function a queue of the last X running total values (In this example 3). This class also implements the security check introduced in my previously updated article related to Running Totals. The security check consist in providing the correct row numbers in the sequence in which the records should be processed and in case the records are processed out of expected order, an exception is fired.

using System;
using Microsoft.SqlServer.Server;
using System.Runtime.Remoting.Messaging;
using System.Data.SqlTypes;
using System.Collections.Generic;

/// <summary>
/// Class contains CLR scalar functions for calculation of running totals
/// </summary>
public class RunningTotalsQueue
{
    /// <summary>
    /// Storage Structure for holding actual Total and row number for security check.
    /// </summary>
    /// <typeparam name="T">Totals Data Type</typeparam>
    private class RtStorage<T> where T : struct
    {
        public T Total;
        public int RowNo;
        public Queue<T> Queue;
        public RtStorage(int queueLength)
        {
            Queue = new Queue<T>(queueLength);
        }
    }

    /// <summary>
    /// Calculates a running totals on Int (Int32) data type
    /// </summary>
    /// <param name="val">Value of current row</param>
    /// <param name="id">ID of the function in single query</param>
    /// <param name="queueLength">Length of the queue. 0 for classical running totals</param>
    /// <param name="rowNo">Specifies expected rowNo. It is for security check to ensure correctness of running totals</param>
    /// <param name="nullValue">Value to be used for NULL values</param>
    /// <param name="nullForLessRows">Specifies whether return NULL if less values than queue are summed</param>
    /// <returns>SqlInt64 representing running total</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlInt64 RunningTotalBigIntQueue(SqlInt64 val, SqlByte id, int queueLength, int rowNo, SqlInt64 nullValue, bool nullForLessRows)
    {
        string dataName = string.Format("MultiSqlRtQueue_{0}", id.IsNull ? 0 : id.Value);

        object lastSum = CallContext.GetData(dataName);

        RtStorage<SqlInt64> storage;
        if (lastSum != null)
            storage = (RtStorage<SqlInt64>)lastSum;
        else
            CallContext.SetData(dataName, storage = new RtStorage<SqlInt64>(queueLength));

        storage.RowNo++;

        if (storage.RowNo != rowNo)
            throw new System.InvalidOperationException(string.Format("Rows were processed out of expected order. Expected RowNo: {0}, received RowNo: {1}", storage.RowNo, rowNo));

        if (!val.IsNull)
            storage.Total = storage.Total.IsNull ? val : storage.Total + val;
        else
            storage.Total = storage.Total.IsNull ? nullValue : (nullValue.IsNull ? storage.Total : storage.Total + nullValue);

        var currentTotal = storage.Total;
        if (queueLength > 0 && storage.Queue.Count == queueLength)
        {
            var lastQueue = storage.Queue.Dequeue();
            currentTotal -= lastQueue.IsNull ? 0 : lastQueue;
        }
        else if (storage.Queue.Count < queueLength && nullForLessRows)
            currentTotal = SqlInt64.Null;

        if (queueLength > 0)
            storage.Queue.Enqueue(storage.Total);

        return currentTotal;
    }

}

On first call the function allocates a queue of requested queue size inside a private storage class and stores it using the CallContext for use by future calls. The function also allows returning NULL when the count of rows processed is lower than the number of rows requested to be calculated in the running total and also allows calculation of classical running totals (not queued) in case the queue length is equal to 0.

CREATE FUNCTION [dbo].[fn_RunningTotalBigIntQueue](
	@val [bigint],                  --value to be added to running total (a fiedl in the query)
	@id [tinyint],                  --id of the running total within a single query
	@queueLength [int],             --lenght of the queue
	@rowNo [int],                   --RowNumber of processed records. This is compared to espected rowNo and in case out of synchronization an exceiption is fired
	@nullValue [bigint] = NULL,     --representation of the NULL value when adding to running totals
	@nullForLesRows [bit] = 0       --specifies whether return NULL if less records than queueLenght were processed
)
RETURNS [bigint]
WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [SqlClrTotals].[RunningTotalsQueue].[RunningTotalBigIntQueue]
GO

Tests

Once we have compiled the sources, created assembly and function in DB, we can make a tests. First we will test the T-SQL solution from the article mentioned at the beginning of this blog post and then the presented CLR solution.

T-SQL solution

SET STATISTICS IO ON
GO
SELECT
    Acc.ID,CONVERT(varchar(10),TransactionDate,101) AS TransactionDate ,
    Balance,
    isnull(RunningTotal,'') AS RunningTotal
FROM Accounts Acc
LEFT OUTER JOIN (SELECT ID,sum(Balance) AS RunningTotal
        FROM (SELECT A.ID AS ID,B.ID AS BID, B.Balance
            FROM Accounts A
                cross JOIN Accounts B
            WHERE B.ID BETWEEN A.ID-4
            AND A.ID AND A.ID>4
            ) T
        GROUP BY ID ) Bal
    ON Acc.ID=Bal.ID
GO
SET STATISTICS IO OFF
GO

Results:

ID          TransactionDate Balance                RunningTotal
----------- --------------- ---------------------- ----------------------
1           01/01/2000      100                    0
2           01/02/2000      101                    0
3           01/03/2000      102                    0
4           01/04/2000      103                    0
5           01/05/2000      104                    510
6           01/06/2000      105                    515
7           01/07/2000      106                    520
8           01/08/2000      107                    525
9           01/09/2000      108                    530
10          01/10/2000      109                    535
11          01/11/2000      200                    630
12          01/12/2000      201                    725
13          01/13/2000      202                    820
14          01/14/2000      203                    915
15          01/15/2000      204                    1010
16          01/16/2000      205                    1015
17          01/17/2000      206                    1020
18          01/18/2000      207                    1025
19          01/19/2000      208                    1030
20          01/20/2000      209                    1035 

(20 row(s) affected) 

Table 'Accounts'. Scan count 37, logical reads 37, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

We cab see, that for only a few records and the T-SQL solution produced 37 logical reads and the query plan is quite complex. And there is a lot of table scans.

If we add a clustered primary key on the ID

ALTER TABLE dbo.Accounts ADD CONSTRAINT PK_Accounts PRIMARY KEY CLUSTERED(ID)

Then the plan will be better as we will replace the three table scans in the query plan by one clustered index scan and two clustered index seeks. But even now the plan is quite complex.

CLR Solution test

SET STATISTICS IO ON
GO
SELECT
    Acc.ID
    ,CONVERT(varchar(10),TransactionDate,101) AS TransactionDate
    ,Balance
    ,ISNULL(dbo.fn_RunningTotalBigIntQueue(Balance, 0, 5, ROW_NUMBER() OVER(ORDER BY ID), NULL, 1), 0) AS RunningTotal
 FROM Accounts Acc
 ORDER BY ID
 OPTION(MAXDOP 1)
GO
SET STATISTICS IO OFF
GO

Here we have results:

ID          TransactionDate Balance                RunningTotal
----------- --------------- ---------------------- ----------------------
1           01/01/2000      100                    0
2           01/02/2000      101                    0
3           01/03/2000      102                    0
4           01/04/2000      103                    0
5           01/05/2000      104                    510
6           01/06/2000      105                    515
7           01/07/2000      106                    520
8           01/08/2000      107                    525
9           01/09/2000      108                    530
10          01/10/2000      109                    535
11          01/11/2000      200                    630
12          01/12/2000      201                    725
13          01/13/2000      202                    820
14          01/14/2000      203                    915
15          01/15/2000      204                    1010
16          01/16/2000      205                    1015
17          01/17/2000      206                    1020
18          01/18/2000      207                    1025
19          01/19/2000      208                    1030
20          01/20/2000      209                    1035 

(20 row(s) affected) 

Table 'Accounts'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Here we can see, that the CLR solution produced exactly the same output as the T-SQL solution. What more, we have only 1 logical read as the table is small and we have also received a very simple execution plan with single Table Scan and Sort. The segment and Sequence Project(Compute Scalar) operators are the ROW_NUMBER() function for security check.  The compute Scalar is our function computing the running totals.

In case of the Clustered Primary Key the plan will contain one Clustered Index scan without sort as the Clustered Key is already in our expected order.

Conclusion

As we can see from the simple test here, the CLR solution is unbeatable for such purposes. Even if we use other T-SQL solution like “quirky updates”, even then the CLR will have less reads and should be quicker.

When increasing the number of records and increasing the last X number, then with higher number count, the T-SQL solution will kill even high end servers. On the other side you will see nearly no difference in speed when using the CLR solution. As it only needs to allocate one queue of last X elements. Even if we want to count last 1000 000 rows, then it only needs to allocate such queue and in case of bigint it will 8 MB + some overhead of the queue.

Of course the CLR solution has the issues with processing rows in right order, but for that purposes there is the security check implemented so in case rows are processed in other than expected order an exception is thrown. But in most cases the scenario is very simple and you need to calculate the total from a simple table and in such cases this solution work without bigger problems. In case of complex queries with JOINS or other windowing functions, records can be processed into a temp table and then the running totals calculated on the temporary table.

SQL Server and fastest running totals using CLR – Updated

Calculating running totals is a common task when reporting warehouse or financial balance. The theory behind running totals is quite easy as it is simply sum of values on all previous rows.

RowID ProductID Date Operation Quantity Total Quantity
1 1 2010/09/01 Buy 1000 1000
2 1 2010/09/02 Buy 500 1500
3 1 2010/09/03 Sell 200 1300
4 1 2010/09/04 Buy 500 1800
5 1 2010/09/05 Sell 1000 800

Some reporting tools provide native support for running total, but some tools not. The real problems comes when we need to calculate running totals inside query. The main problems are with performance and sometimes complexity of such solution.

Jeff Moden has a great article about calculating running totals using T-SQL on SQL Server Central: Solving the “Running Total” & “Ordinal Rank” Problems.

For relatively small record sets the T-SQL solution is quite usable, but for larger record sets the real performance problems comes (a lot of reads and processing).

Although there are many articles related to running totals, nearly non of them describes the fastest possible method for calculating running totals – CLR. On the other side the CLR solution brings some issues which we have to take in mind.

I have updated the article after some comments and further email communication with Paul White MVP (blog). Paul has mentioned issues with rows ordering for calculation of the running totals so this updated version reflect this and also I’ve added a security check to the functions so those functions throws exceptions when and issue with row ordering is detected.

CLR solution for running totals

As I mentioned above, the CLR solution is the simplest and quickest solution for the running totals. Before we start with the CLR solution we will prepare a testing database and test data.

--======================
-- Create test database
--======================
CREATE DATABASE RunningTotalsTest
GO
USE RunningTotalsTest
GO 

--===========================
-- Create small test table
--===========================
CREATE TABLE SmallTable (
    RowID int NOT NULL IDENTITY(1,1),
    Value int, RandomValue int
)
GO 

--Add Clustered primary key to the table
ALTER TABLE SmallTable
    ADD CONSTRAINT PK_SmallTable
    PRIMARY KEY CLUSTERED (RowID)
    WITH FILLFACTOR = 100
GO 

--===========================
-- Create big test table
--===========================

--The big table will be heap, becase we will want to test paralel query plans on it
CREATE TABLE BigTable (
    RowID int NOT NULL IDENTITY(1,1),
    Value int,
    RandomValue int,
    BigData binary(3000)
)
GO 

--==============================
-- Fill tables with test data
--============================== 

--fill small table with data 1000 records
WITH TallyBase AS (
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N
),
Tally AS (
    SELECT
        ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) As Value
    FROM TallyBase t1, TallyBase t2, TallyBase t3
)
INSERT INTO SmallTable (
    Value,
    RandomValue
)
SELECT
    Value,
    ABS(CHECKSUM(NEWID())) % 100000
FROM Tally
GO 

--fill BigTable with 100000 records
WITH TallyBase AS (
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N UNION ALL
    SELECT 1 AS N
),
Tally AS (
    SELECT
        ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) As Value
    FROM TallyBase t1, TallyBase t2, TallyBase t3, TallyBase t4, TallyBase t5
)
INSERT INTO BigTable (
    Value,
    RandomValue
)
SELECT
    Value,
    ABS(CHECKSUM(NEWID())) % 100000
FROM Tally
GO

Simple scalar CLR running totals function

In our implementation we will use a simple scalar CLR function. This function will store current value for next row and in that row it will retrieve the value from previous row and calculate total. For this purpose we will use CallContext class (See MSDN for details).

Here is testing class and function for BigInt data type.

public class RunningTotals
{
    /// <summary>
    /// Storage Structure for holding actual Total and row number for security check.
    /// </summary>
    /// <typeparam name="T">Totals Data Type</typeparam>
    private struct RtStorage<T> where T : struct
    {
        public T Total;
        public int RowNo;
    }

    /// <summary>
    /// Calculates a running totals on BigInt (Int64) data type
    /// </summary>
    /// <param name="val">Value of current row</param>
    /// <param name="id">ID of the function in single query</param>
    /// <param name="rowNo">Specifies expecter rowNo. It is for security check to ensure correctness of running totals</param>
    /// <param name="nullValue">Value to be used for NULL values</param>
    /// <returns>SqlInt64 representing running total</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlInt64 RunningTotalBigInt(SqlInt64 val, SqlByte id, int rowNo, SqlInt64 nullValue)
    {
        string dataName = string.Format("MultiSqlRt_{0}", id.IsNull ? 0 : id.Value);

        object lastSum = CallContext.GetData(dataName);

        var storage = lastSum != null ? (RtStorage<SqlInt64>)lastSum : new RtStorage<SqlInt64>();
        storage.RowNo++;

        if (storage.RowNo != rowNo)
            throw new System.InvalidOperationException(string.Format("Rows were processed out of expected order. Expected RowNo: {0}, received RowNo: {1}", storage.RowNo, rowNo));

        if (!val.IsNull)
            storage.Total = storage.Total.IsNull ? val : storage.Total + val;
        else
            storage.Total = storage.Total.IsNull ? nullValue : (nullValue.IsNull ? storage.Total : storage.Total + nullValue);

        CallContext.SetData(dataName, storage);

        return storage.Total;
    }
}

The function has several input parameters:

  • val parameter represents the value to be accumulated into the running total.
  • id parameter is identification of the function in a select statement if there are more functions used in single select (eg. accumulating costs and incomes separately) – it is necessary to have separate storage for each call of the function.
  • rowNo is parameter for the security check to ensure, that rows are accumulated in expected order and correct results are provided. In case an issue with accumulation order is detected, an exception is thrown so user immediately knows the calculation is wrong.  The security check is based on communication with Paul White (MVP) as mentioned above and the value for this parameter will be provided by ROW_NUMBER() windowing function with expected order.
  • nullValue allows specify value for null handling (eg. you may want to accumulate some particular value in case the val is null.

Once we compile the assembly with the class, we can register it in our test database. Because we use the CallContext class, the assembly will require PERMISSION_SET = UNSAFE and because of this we have to set also the database TRUSTWORTHY to ON;

--=================================================================
-- Set Testing database TRUSTWORTHY ON so we can use UNSAFE code
--=================================================================
ALTER DATABASE [RunningTotalsTest]
SET TRUSTWORTHY ON WITH ROLLBACK IMMEDIATE
GO 

--==========================================
-- Create assembly
--==========================================
CREATE ASSEMBLY [SqlClrTotals]
AUTHORIZATION [dbo]
FROM 'C:SQLCLRSqlClrTotals.dll'
WITH PERMISSION_SET = UNSAFE
GO 

--==========================================
-- Create testing function
--==========================================
CREATE FUNCTION [dbo].[fn_RunningTotalBigInt](
	@val [bigint],					--value to be added to running total (a fiedl in the query)
	@id [tinyint],					--id of the running total within a single query
	@rowNo [int],					--RowNumber of processed records. This is compared to espected rowNo and in case out of synchronization an exceiption is fired
	@nullValue [bigint] = NULL		--representation of the NULL value when adding to running totals
)
RETURNS [bigint]
WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [SqlClrTotals].[RunningTotals].[RunningTotalBigInt]
GO

Once the assembly and function is created, we can start testing the function on our test tables.

Testing running total function on queries with serial plan

Fist test will be done on simple queries which will have a simple serial plan.

Simple query with one running total

SET STATISTICS IO ON
SET STATISTICS TIME ON
GO 

SELECT
    RowID,
    Value,
    dbo.fn_RunningTotalBigInt(Value, 0, ROW_NUMBER() OVER(ORDER BY RowID), null) as RunningTotal
FROM dbo.SmallTable
ORDER BY RowID

SET STATISTICS TIME OFF
SET STATISTICS IO OFF
GO

Please note the ROW_NUMBER() OVER(ORDER BY RowID). This is the security check by which we provide RowNumber in the order of RowID. In case of some inconsistencies (rows arrive in different order as required. This can happen because the query engine doesn’t guarantee the rows order until final result set is received) an exception will be thrown.

As we can see, the syntax for this query is very simple and it also give as a very good and correct results:

RowID       Value       RunningTotal
----------- ----------- --------------------
1           1           1
2           2           3
3           3           6
4           4           10
5           5           15
6           6           21
.           .           .
.           .           .
.           .           .
998         998         498501
999         999         499500
1000        1000        500500 

(1000 row(s) affected) 

SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms. 

(1000 row(s) affected)
Table 'SmallTable'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

(1 row(s) affected)
 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 3 ms.

And also a very simple query plan:

Non of the T-SQL methods will produce such simple query plan. The Sequence Project is calculation of the ROW_NUMBER() for security check.

Multiple running totals in single query

Now let test the situation if we use the function two times in the same query.

SELECT
    RowID,
    Value,
    [dbo].fn_RunningTotalBigInt(Value, 0, ROW_NUMBER() OVER(ORDER BY RowID), null) as RunningTotal,
    RandomValue,
    [dbo].fn_RunningTotalBigInt(RandomValue, 1, ROW_NUMBER() OVER(ORDER BY RowID), null) as RandomRunningTotal
FROM dbo.SmallTable
ORDER BY RowID

Result:

RowID       Value       RunningTotal         RandomValue RandomRunningTotal
----------- ----------- -------------------- ----------- --------------------
1           1           1                    15137       15137
2           2           3                    97303       112440
3           3           6                    14688       127128
4           4           10                   64512       191640
5           5           15                   20463       212103
.           .           .                    .           .
.           .           .                    .           .
996         996         496506               64071       50040800
997         997         497503               58488       50099288
998         998         498501               65331       50164619
999         999         499500               97196       50261815
1000        1000        500500               65433       50327248

As we can see, we have again correct results. Note, that each call to the fn_RunningTotalInt has different ID.

In case we use same ID the execution will fail with detection of rows being processed in unexpected order.

SELECT
    RowID,
    Value,
    [dbo].fn_RunningTotalBigInt(Value, 0, ROW_NUMBER() OVER(ORDER BY RowID), null) as RunningTotal,
    RandomValue,
    [dbo].fn_RunningTotalBigInt(RandomValue, 0, ROW_NUMBER() OVER(ORDER BY RowID), null) as RandomRunningTotal
FROM dbo.SmallTable
ORDER BY RowID
Msg 6522, Level 16, State 2, Line 1
A .NET Framework error occurred during execution of user-defined routine or aggregate "fn_RunningTotalInt":
System.InvalidOperationException: Rows were processed out of expected order. Expected RowNo: 2, received RowNo: 1
System.InvalidOperationException:
 in RunningTotals.RunningTotalInt(SqlInt32 val, SqlByte id, Int32 rowNo, SqlInt32 nullValue)

This is caused by the fact that in case of using identical id, the internal counters are used by both functions instead of single counter for each function call.

Detection of processing row order changes

As I  have mentioned at the beginning fo the article, this update was done upon Paul White comments, so now the function detect such issues if they occurs. Let’s simulate such issue.

SELECT
    RowID,
    Value,
    [ClrUnsafe].fn_RunningTotalBigInt(Value, 0, ROW_NUMBER() OVER(ORDER BY RowID), null) as RunningTotal,
    ROW_NUMBER() OVER(ORDER BY RowID DESC) DescendingRowOrder
FROM dbo.SmallTable
ORDER BY RowID
Msg 6522, Level 16, State 2, Line 1
A .NET Framework error occurred during execution of user-defined routine or aggregate "fn_RunningTotalInt":
System.InvalidOperationException: Rows were processed out of expected order. Expected RowNo: 1, received RowNo: 1000
System.InvalidOperationException:
in RunningTotals.RunningTotalInt(SqlInt32 val, SqlByte id, Int32 rowNo, SqlInt32 nullValue)

This query produces below plan:

The issue is, that the the scalar operator is after a sort in descending order which is being done for the DescendingRowOrder and prior the final sort. However if such issue occurs, it is correctly detected and an exception raised. This issue can occur even the scalar operator was the last operator as the order of rows is not guaranteed by the query processor until final result is received, but in normal cases in single threaded processing it generally works and in case of some issues, we have the check implemented.

Null values handling

Now we will focus on the NULL values handling. Let’s run another simple query with some NULL values.

WITH Data (RowID, Value) AS (
    SELECT 1 AS RowId, NULL As Value
    UNION ALL SELECT 2 AS RowId, 1 As Value
    UNION ALL SELECT 3 AS RowId, NULL As Value
    UNION ALL SELECT 4 AS RowId, 1 As Value UNION ALL
    SELECT 5 AS RowId, 1 As Value
)
SELECT
    RowId
    ,Value
    ,[dbo].fn_RunningTotalBigInt(Value, 0, ROW_NUMBER() OVER(ORDER BY RowID), null) as TotalNull
    ,[dbo].fn_RunningTotalBigInt(Value, 1, ROW_NUMBER() OVER(ORDER BY RowID), 0) as TotalZero
    ,[dbo].fn_RunningTotalBigInt(Value, 2, ROW_NUMBER() OVER(ORDER BY RowID), 10) as TotalTen
FROM Data
ORDER BY RowID

And of course result of such solution:

RowId       Value       TotalNull            TotalZero            TotalTen
----------- ----------- -------------------- -------------------- --------------------
1           NULL        NULL                 0                    10
2           1           1                    1                    11
3           NULL        1                    1                    21
4           1           2                    2                    22
5           1           3                    3                    23

Here we can see, that if we pass NULL, then the NULL values are ignored and in the other two cases the NULL values are handled as 0 respectively 10.

Parallel plans and not ordered result set testing

Till now we had tested only simple queries, which produce serial plan. But what if we will use the function in much more complicated queries, or queries processing high amount of data and which produce parallel query plan on multi processor (core) systems.

Let’s simulate such scenario on our big table, which should produce a parallel plan on multi core system and default settings of SQL Server.

SELECT
    RowID
    ,Value
    ,[dbo].fn_RunningTotalBigInt(Value, 0, ROW_NUMBER() OVER(ORDER BY RowID), null) as RunningTotal
FROM dbo.BigTable
ORDER BY RowID

Here is query plan for above query:

And here results:

RowID       Value       RunningTotal
----------- ----------- --------------------
1           1           1
2           2           3
3           3           6
4           4           10
99994       99994       4999450015
99995       99995       4999550010
99996       99996       4999650006
99997       99997       4999750003
99998       99998       4999850001
99999       99999       4999950000
100000      100000      5000050000

(100000 row(s) affected)

In this case the results are correct as Compute Scalar was performed after the parallel streams were garthered.
In case the computation of scalar running total value occured prior the parallel streams are garthered, then again an issue that rows are not processed in corect order may occur.
To avoid this kind of problems we can force serial query plan by setting max degree of parallelism for the query using the OPTION(MAXDOP 1)

SELECT
    RowID
    ,Value
    ,[dbo].fn_RunningTotalBigInt(Value, 0, ROW_NUMBER() OVER(ORDER BY RowID), null) as RunningTotal
FROM dbo.BigTable
ORDER BY RowID
OPTION(MAXDOP 1)

This query will always produce a serial plan. In this case following:

Final form of Running Totals

From the above we can construct a final form of the Running Totals functions for each numeric data type. The final form can look like following class;

using System;
using Microsoft.SqlServer.Server;
using System.Runtime.Remoting.Messaging;
using System.Data.SqlTypes;
using System.Collections.Generic;

/// <summary>
/// Class contains CLR scalar functions for calculation of running totals
/// </summary>
public class RunningTotals
{
    /// <summary>
    /// Storage Structure for holding actual Total and row number for security check.
    /// </summary>
    /// <typeparam name="T">Totals Data Type</typeparam>
    private struct RtStorage<T> where T : struct
    {
        public T Total;
        public int RowNo;
    }

    /// <summary>
    /// Calculates a running totals on TinyInt (byte) data type
    /// </summary>
    /// <param name="val">Value of current row</param>
    /// <param name="id">ID of the function in single query</param>
    /// <param name="rowNo">Specifies expecter rowNo. It is for security check to ensure correctness of running totals</param>
    /// <param name="nullValue">Value to be used for NULL values</param>
    /// <returns>SqlByte representing running total</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlByte RunningTotalTinyInt(SqlByte val, SqlByte id, int rowNo, SqlByte nullValue)
    {
        string dataName = string.Format("MultiSqlRt_{0}", id.IsNull ? 0 : id.Value);

        object lastSum = CallContext.GetData(dataName);

        var storage = lastSum != null ? (RtStorage<SqlByte>)lastSum : new RtStorage<SqlByte>();
        storage.RowNo++;

        if (storage.RowNo != rowNo)
            throw new System.InvalidOperationException(string.Format("Rows were processed out of expected order. Expected RowNo: {0}, received RowNo: {1}", storage.RowNo, rowNo));

        if (!val.IsNull)
            storage.Total = storage.Total.IsNull ? val : storage.Total + val;
        else
            storage.Total = storage.Total.IsNull ? nullValue : (nullValue.IsNull ? storage.Total : storage.Total + nullValue);

        CallContext.SetData(dataName, storage);

        return storage.Total;
    }

    /// <summary>
    /// Calculates a running totals on SmallInt (Int) data type
    /// </summary>
    /// <param name="val">Value of current row</param>
    /// <param name="id">ID of the function in single query</param>
    /// <param name="rowNo">Specifies expecter rowNo. It is for security check to ensure correctness of running totals</param>
    /// <param name="nullValue">Value to be used for NULL values</param>
    /// <returns>SqlInt16 representing running total</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlInt16 RunningTotalSmallInt(SqlInt16 val, SqlByte id, int rowNo, SqlInt16 nullValue)
    {
        string dataName = string.Format("MultiSqlRt_{0}", id.IsNull ? 0 : id.Value);

        object lastSum = CallContext.GetData(dataName);

        var storage = lastSum != null ? (RtStorage<SqlInt16>)lastSum : new RtStorage<SqlInt16>();
        storage.RowNo++;

        if (storage.RowNo != rowNo)
            throw new System.InvalidOperationException(string.Format("Rows were processed out of expected order. Expected RowNo: {0}, received RowNo: {1}", storage.RowNo, rowNo));

        if (!val.IsNull)
            storage.Total = storage.Total.IsNull ? val : storage.Total + val;
        else
            storage.Total = storage.Total.IsNull ? nullValue : (nullValue.IsNull ? storage.Total : storage.Total + nullValue);

        CallContext.SetData(dataName, storage);

        return storage.Total;
    }

    /// <summary>
    /// Calculates a running totals on Int (Int32) data type
    /// </summary>
    /// <param name="val">Value of current row</param>
    /// <param name="id">ID of the function in single query</param>
    /// <param name="rowNo">Specifies expecter rowNo. It is for security check to ensure correctness of running totals</param>
    /// <param name="nullValue">Value to be used for NULL values</param>
    /// <returns>SqlInt32 representing running total</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlInt32 RunningTotalInt(SqlInt32 val, SqlByte id, int rowNo, SqlInt32 nullValue)
    {
        string dataName = string.Format("MultiSqlRt_{0}", id.IsNull ? 0 : id.Value);

        object lastSum = CallContext.GetData(dataName);

        var storage = lastSum != null ? (RtStorage<SqlInt32>)lastSum : new RtStorage<SqlInt32>();
        storage.RowNo++;

        if (storage.RowNo != rowNo)
            throw new System.InvalidOperationException(string.Format("Rows were processed out of expected order. Expected RowNo: {0}, received RowNo: {1}", storage.RowNo, rowNo));

        if (!val.IsNull)
            storage.Total = storage.Total.IsNull ? val : storage.Total + val;
        else
            storage.Total = storage.Total.IsNull ? nullValue : (nullValue.IsNull ? storage.Total : storage.Total + nullValue);

        CallContext.SetData(dataName, storage);

        return storage.Total;
    }

    /// <summary>
    /// Calculates a running totals on BigInt (Int64) data type
    /// </summary>
    /// <param name="val">Value of current row</param>
    /// <param name="id">ID of the function in single query</param>
    /// <param name="rowNo">Specifies expecter rowNo. It is for security check to ensure correctness of running totals</param>
    /// <param name="nullValue">Value to be used for NULL values</param>
    /// <returns>SqlInt64 representing running total</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlInt64 RunningTotalBigInt(SqlInt64 val, SqlByte id, int rowNo, SqlInt64 nullValue)
    {
        string dataName = string.Format("MultiSqlRt_{0}", id.IsNull ? 0 : id.Value);

        object lastSum = CallContext.GetData(dataName);

        var storage = lastSum != null ? (RtStorage<SqlInt64>)lastSum : new RtStorage<SqlInt64>();
        storage.RowNo++;

        if (storage.RowNo != rowNo)
            throw new System.InvalidOperationException(string.Format("Rows were processed out of expected order. Expected RowNo: {0}, received RowNo: {1}", storage.RowNo, rowNo));

        if (!val.IsNull)
            storage.Total = storage.Total.IsNull ? val : storage.Total + val;
        else
            storage.Total = storage.Total.IsNull ? nullValue : (nullValue.IsNull ? storage.Total : storage.Total + nullValue);

        CallContext.SetData(dataName, storage);

        return storage.Total;
    }

    /// <summary>
    /// Calculates a running totals on Float (Double) data type
    /// </summary>
    /// <param name="val">Value of current row</param>
    /// <param name="id">ID of the function in single query</param>
    /// <param name="rowNo">Specifies expecter rowNo. It is for security check to ensure correctness of running totals</param>
    /// <param name="nullValue">Value to be used for NULL values</param>
    /// <returns>SqlDouble representing running total</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlDouble RunningTotalFloat(SqlDouble val, SqlByte id, int rowNo, SqlDouble nullValue)
    {
        string dataName = string.Format("MultiSqlRt_{0}", id.IsNull ? 0 : id.Value);

        object lastSum = CallContext.GetData(dataName);

        var storage = lastSum != null ? (RtStorage<SqlDouble>)lastSum : new RtStorage<SqlDouble>();
        storage.RowNo++;

        if (storage.RowNo != rowNo)
            throw new System.InvalidOperationException(string.Format("Rows were processed out of expected order. Expected RowNo: {0}, received RowNo: {1}", storage.RowNo, rowNo));

        if (!val.IsNull)
            storage.Total = storage.Total.IsNull ? val : storage.Total + val;
        else
            storage.Total = storage.Total.IsNull ? nullValue : (nullValue.IsNull ? storage.Total : storage.Total + nullValue);

        CallContext.SetData(dataName, storage);

        return storage.Total;
    }

    /// <summary>
    /// Calculates a running totals on Real (Single) data type
    /// </summary>
    /// <param name="val">Value of current row</param>
    /// <param name="id">ID of the function in single query</param>
    /// <param name="rowNo">Specifies expecter rowNo. It is for security check to ensure correctness of running totals</param>
    /// <param name="nullValue">Value to be used for NULL values</param>
    /// <returns>SqlSingle representing running total</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlSingle RunningTotalReal(SqlSingle val, SqlByte id, int rowNo, SqlSingle nullValue)
    {
        string dataName = string.Format("MultiSqlRt_{0}", id.IsNull ? 0 : id.Value);

        object lastSum = CallContext.GetData(dataName);

        var storage = lastSum != null ? (RtStorage<SqlSingle>)lastSum : new RtStorage<SqlSingle>();
        storage.RowNo++;

        if (storage.RowNo != rowNo)
            throw new System.InvalidOperationException(string.Format("Rows were processed out of expected order. Expected RowNo: {0}, received RowNo: {1}", storage.RowNo, rowNo));

        if (!val.IsNull)
            storage.Total = storage.Total.IsNull ? val : storage.Total + val;
        else
            storage.Total = storage.Total.IsNull ? nullValue : (nullValue.IsNull ? storage.Total : storage.Total + nullValue);

        CallContext.SetData(dataName, storage);

        return storage.Total;
    }

    /// <summary>
    /// Calculates a running totals on Money data type
    /// </summary>
    /// <param name="val">Value of current row</param>
    /// <param name="id">ID of the function in single query</param>
    /// <param name="rowNo">Specifies expecter rowNo. It is for security check to ensure correctness of running totals</param>
    /// <param name="nullValue">Value to be used for NULL values</param>
    /// <returns>SqlMoney representing running total</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlMoney RunningTotalMoney(SqlMoney val, SqlByte id, int rowNo, SqlMoney nullValue)
    {
        string dataName = string.Format("MultiSqlRt_{0}", id.IsNull ? 0 : id.Value);

        object lastSum = CallContext.GetData(dataName);

        var storage = lastSum != null ? (RtStorage<SqlMoney>)lastSum : new RtStorage<SqlMoney>();
        storage.RowNo++;

        if (storage.RowNo != rowNo)
            throw new System.InvalidOperationException(string.Format("Rows were processed out of expected order. Expected RowNo: {0}, received RowNo: {1}", storage.RowNo, rowNo));

        if (!val.IsNull)
            storage.Total = storage.Total.IsNull ? val : storage.Total + val;
        else
            storage.Total = storage.Total.IsNull ? nullValue : (nullValue.IsNull ? storage.Total : storage.Total + nullValue);

        CallContext.SetData(dataName, storage);

        return storage.Total;
    }

    /// <summary>
    /// Calculates a running totals on Decimal data type
    /// </summary>
    /// <param name="val">Value of current row</param>
    /// <param name="id">ID of the function in single query</param>
    /// <param name="rowNo">Specifies expecter rowNo. It is for security check to ensure correctness of running totals</param>
    /// <param name="nullValue">Value to be used for NULL values</param>
    /// <returns>SqlDecimal representing running total</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlDecimal RunningTotalDecimal(SqlDecimal val, SqlByte id, int rowNo, SqlDecimal nullValue)
    {
        string dataName = string.Format("MultiSqlRt_{0}", id.IsNull ? 0 : id.Value);

        object lastSum = CallContext.GetData(dataName);

        var storage = lastSum != null ? (RtStorage<SqlDecimal>)lastSum : new RtStorage<SqlDecimal>();
        storage.RowNo++;

        if (storage.RowNo != rowNo)
            throw new System.InvalidOperationException(string.Format("Rows were processed out of expected order. Expected RowNo: {0}, received RowNo: {1}", storage.RowNo, rowNo));

        if (!val.IsNull)
            storage.Total = storage.Total.IsNull ? val : storage.Total + val;
        else
            storage.Total = storage.Total.IsNull ? nullValue : (nullValue.IsNull ? storage.Total : storage.Total + nullValue);

        CallContext.SetData(dataName, storage);

        return storage.Total;
    }
}

Conclusion

As we have tested and seen, the CLR version of calculation running totals produces very simple query plans and is one of the fastest methods to calculate running totals.

The CLR functions are really simple and fast.

On one side the CLR solution for running totals is one of the fastest, but on the other side we have to remember that this solution requires, that records are processed in order in which the running totals should be calculated. However the Query Optimizer and Query Processor doesn’t guarantee, that the records will be processed in particular order. It only guarantees, that the final output will be sorced according the ORDER BY clause.

To ensure, that those function to not produce incorrect results a security check in the form of rowNo generated by ROW_NUMBER() function needs to be part of the solution an in case the solution could produce incorrect results, it raises an exception.

So I would like to thank gain to Paul White MVP (blog) for his notes.

Remarks

You can download a complete Visual C# project + SQL Scripts (script for generating sample database, script for creating CLR objects + script with test queries) from my SkyDrive here: SqlClrTotals_Updated.zip

SQL Server – String splitting (T-SQL vs. CLR)

This blog post is inspired by forum thread on SQL Server Central Convert CSV values in three columns to rows.

On many sites there are comparison of string splitting functions using pure T-SQL. But not much of them also mention the CLR way of doing it.

On small strings with a few elements the T-SQL solution is unbeatable in speed. But when the number of elements in the string increase, the situation is completely different.

In this comparison I will use one of the quickest T-SQL solution vs. a CLR solution using regular expressions.

For the fastest CLR solution, check my blog post: Fastest CSV strings splitting using CLR (T-SQL vs. CLR revisited).

Probably the quickest known T-SQL solution is function mentioned by Jeff Moden in the forum thread http://www.sqlservercentral.com/Forums/FindPost944589.aspx.

CREATE FUNCTION [dbo].[DelimitedSplit8K]
--/***************************************************************************************************
-- Purpose:
-- Split a given string at a given delimiter and return a list of the split elements (items).

-- Returns:
-- iTVF containing the following:
-- ItemNumber = Element position of Item as a BIGINT (not converted to INT to eliminate a CAST)
-- Item       = Element value as a VARCHAR(8000)

-- CROSS APPLY Usage Example:
-----------------------------------------------------------------------------------------------------
----===== Conditionally drop the test tables to make reruns easier for testing.
--     -- (this is NOT a part of the solution)
--     IF OBJECT_ID('tempdb..#JBMTest') IS NOT NULL
--        DROP TABLE #JBMTest
--;
----===== Create and populate a test table on the fly (this is NOT a part of the solution).
-- SELECT *
--   INTO #JBMTest
--   FROM (
--         SELECT 1,'1,10,100,1000,10000,100000,1000000' UNION ALL
--         SELECT 2,'2000000,200000,20000,2000,200,20,2' UNION ALL
--         SELECT 3, 'This,is,a,test'                    UNION ALL
--         SELECT 4, 'and so is this'                    UNION ALL
--         SELECT 5, 'This, too (no pun intended)'
--        ) d (SomeID,SomeValue)
--;
--GO
----===== Split the CSV column for the whole table using CROSS APPLY (this is the solution)
-- SELECT test.SomeID, split.ItemNumber, split.Item
--   FROM #JBMTest test
--  CROSS APPLY
--        (
--         SELECT ItemNumber, Item
--           FROM dbo.DelimitedSplit8k(test.SomeValue,',')
--        ) split
--;
-----------------------------------------------------------------------------------------------------
-- Notes:
-- 1. Optimized for VARCHAR(7999) or less.  No testing or error reporting for truncation at 7999
--    characters is done.
-- 2. Optimized for single character delimiter.  Multi-character delimiters should be resolved
--    externally from this function.
-- 3. Optimized for use with CROSS APPLY.
-- 4. Does not "trim" elements just in case leading or trailing blanks are intended.
-- 5. If you don't know how a Tally table can be used to replace loops, please see the following...

--http://www.sqlservercentral.com/articles/T-SQL/62867/

-- 6. Changing this function to use VARCHAR(MAX) will cause it to run twice as slow.  It's just the
--    nature of VARCHAR(MAX) whether it fits in-row or not.
-- 7. Multi-machine testing for the method of using UNPIVOT instead of 10 SELECT/UNION ALLs shows
--    that the UNPIVOT method is quite machine dependent and can slow things down quite a bit.
-- 8. Performance testing shows using "TOP" for the limiting criteria of "N" is actually
--    slower and slightly more CPU intensive than the traditional WHERE N < LEN(@pString) + 2.
-- 9. Performance testing shows using ORDER BY (SELECT x) where "x" is anything is actually
--    slower and slightly more CPU intensive than the traditional ORDER BY (SELECT N).

-- Credits:
-- This code is the product of many people's efforts including but not limited to the following:
-- cteTally concept originally by Iztek Ben Gan and "decimalized" by Lynn Pettis (and others) for a
-- bit of extra speed and finally redacted by Jeff Moden for a different slant on readability and
-- compactness. Hat's off to Paul White for his simple explanations of CROSS APPLY. Finally,
-- special thanks to Erland Sommarskog for his tireless efforts to help people understand
-- what you can actually do with T-SQL. I also thank whoever wrote the first article I ever saw
-- on "numbers tables" which is located at the following URL ...

--http://sqlserver2000.databases.aspfaq.com/why-should-i-consider-using-an-auxiliary-numbers-table.html

-- Revision History:
-- Rev 00 - 20 Jan 2010 - Concept: Lynn Pettis and others.
--                        Redaction/Implementation: Jeff Moden
--        - Base 10 redaction and reduction for CTE.  (Total rewrite)

-- Rev 01 - 13 Mar 2010 - Jeff Moden
--        - Removed one additional concatenation and one subtraction from the SUBSTRING in the
--          SELECT List for that tiny bit of extra speed.

-- Rev 02 - 14 Apr 2010 - Jeff Moden
--        - No code changes.  Added CROSS APPLY usage example to the header, some additional credits,
--          and extra documentation.

-- Rev 03 - 18 Apr 2010 - Jeff Moden
--        - No code changes.  Added notes 7, 8, and 9 about certain "optimizations" that dont'
--          actually work for this type of function.

-- Rev 04 - 29 Jun 2010 - Jeff Moden
--        - Added WITH SCHEMABINDING thanks to a note by Paul White.  This prevents an unnecessary
--          "Table Spool" when the function is used in an UPDATE statement even though the function
--          makes no external references.
--***************************************************************************************************/
--===== Define I/O parameters
        (
        @pString    VARCHAR(7999),
        @pDelimiter CHAR(1)
        )
RETURNS TABLE
   WITH SCHEMABINDING
AS
RETURN
--===== "Inline" CTE Driven "Tally Table” produces values up to
     -- 10,000... enough to cover VARCHAR(8000)
WITH
      E1(N) AS ( --=== Create Ten 1's
                 SELECT 1 UNION ALL SELECT 1 UNION ALL
                 SELECT 1 UNION ALL SELECT 1 UNION ALL
                 SELECT 1 UNION ALL SELECT 1 UNION ALL
                 SELECT 1 UNION ALL SELECT 1 UNION ALL
                 SELECT 1 UNION ALL SELECT 1 --10
               ),
      E2(N) AS (SELECT 1 FROM E1 a, E1 b),   --100
      E4(N) AS (SELECT 1 FROM E2 a, E2 b),   --10,000
cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT N)) FROM E4)
--===== Do the split
 SELECT ROW_NUMBER() OVER (ORDER BY N) AS ItemNumber,
        SUBSTRING(@pString, N, CHARINDEX(@pDelimiter, @pString + @pDelimiter, N) - N) AS Item
   FROM cteTally
  WHERE N < LEN(@pString) + 2
    AND SUBSTRING(@pDelimiter + @pString, N, 1) = @pDelimiter
;
GO

For the CLR we will use the RegEx function I mentioned in one of my previous blog post: SQL Server 2005 and SQL Server 2008 Regular Expressions and pattern matching.

To make the comparison more interesting I will use the table structure as Jeff Moden has used in his post http://www.sqlservercentral.com/Forums/FindPost988659.aspx. You can also took on the comparison of different T-SQL solutions made by Jeff in the mentioned thread.

Test data preparation

We will build 3 test data table. One for string with 16 element (CsvTest), one for 100 element (CSVTest3) and one for 1333 elements (CSVTest2). The 1333 elements is near the varchar(8000) for which the T-SQL version of DelimitedSplit8K is optimized.

WITH TallyBase AS (
    SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL
    SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N
),
Tally AS (
    SELECT
        CAST(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS INT) as N
    FROM TallyBase t1, TallyBase t2, TallyBase T3, TallyBase T4
)
SELECT TOP (10000) --Controls the number of rows in the test table
    ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
    (
        SELECT CAST(STUFF( --=== STUFF get`s rid of the leading comma
                ( --=== This builds CSV row with a leading comma
                SELECT TOP (16) --Controls the number of CSV elements in each row
                    ',' + CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
                FROM Tally t3      --Classic cross join pseudo-cursor
                CROSS JOIN Tally t4 --can produce row sets up 121 million.
                WHERE t1.N <> t3.N --Without this line, all rows would be the same
                FOR XML PATH('')
                )
                ,1,1,'') AS NVARCHAR(MAX))
                ) AS CsvParameter
INTO CsvTest
FROM Tally t1        --Classic cross join pseudo-cursor
CROSS JOIN Tally t2;  --can produce row sets up 121 million.
GO

ALTER TABLE dbo.CsvTest ADD PRIMARY KEY (RowNum) WITH FILLFACTOR=100
GO

WITH TallyBase AS (
    SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL
    SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N
),
Tally AS (
    SELECT
        CAST(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS INT) as N
    FROM TallyBase t1, TallyBase t2, TallyBase T3, TallyBase T4
)
SELECT TOP (10000) --Controls the number of rows in the test table
    ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
    (
        SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
                ( --=== This builds CSV row with a leading comma
                SELECT TOP (100) --Controls the number of CSV elements in each row
                    ','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
                FROM Tally t3      --Classic cross join pseudo-cursor
                CROSS JOIN Tally t4 --can produce row sets up 121 million.
                WHERE t1.N <> t3.N --Without this line, all rows would be the same
                FOR XML PATH('')
                )
                ,1,1,'') AS NVARCHAR(MAX))
                ) AS CsvParameter
INTO CsvTest3
FROM Tally t1        --Classic cross join pseudo-cursor
CROSS JOIN Tally t2;  --can produce row sets up 121 million.
GO

ALTER TABLE dbo.CsvTest3 ADD PRIMARY KEY (RowNum) WITH FILLFACTOR=100
GO

WITH TallyBase AS (
    SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL
    SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N
),
Tally AS (
    SELECT
        CAST(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS INT) as N
    FROM TallyBase t1, TallyBase t2, TallyBase T3, TallyBase T4
)
SELECT TOP (10000) --Controls the number of rows in the test table
    ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
    (
        SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
                ( --=== This builds CSV row with a leading comma
                SELECT TOP (1333) --Controls the number of CSV elements in each row
                    ','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
                FROM Tally t3      --Classic cross join pseudo-cursor
                CROSS JOIN Tally t4 --can produce row sets up 121 million.
                WHERE t1.N <> t3.N --Without this line, all rows would be the same
                FOR XML PATH('')
                )
                ,1,1,'') AS NVARCHAR(MAX))
                ) AS CsvParameter
INTO CsvTest2
FROM Tally t1        --Classic cross join pseudo-cursor
CROSS JOIN Tally t2;  --can produce row sets up 121 million.
GO

ALTER TABLE dbo.CsvTest2 ADD PRIMARY KEY (RowNum) WITH FILLFACTOR=100
GO

Speed testing and comparison

The speed comparison Tests will be done using the below queries

--T-SQL version
DBCC DROPCLEANBUFFERS
GO
DBCC FREEPROCCACHE
GO

SET STATISTICS IO ON
SET STATISTICS TIME ON
GO
--===  T-SQL Version  ===
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.ItemNumber,
    @ItemValue = V.Item
FROM dbo.CsvTest D
CROSS APPLY dbo.DelimitedSplit8K(D.CsvParameter, ',') V
GO

SET STATISTICS TIME OFF
SET STATISTICS IO OFF
GO

--CLR Version
DBCC DROPCLEANBUFFERS
GO
DBCC FREEPROCCACHE
GO

SET STATISTICS IO ON
SET STATISTICS TIME ON
GO
--===  CLR-RegEx Version  ===
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.matchId,
    @ItemValue = CAST(V.value as int)
FROM dbo.CsvTest D
CROSS APPLY dbo.fn_RegExMatches(D.CsvParameter, '(d+?)(?:,|$)') V
WHERE V.groupId = 1 --Return only GroupID=1 as this group contains the value GroupID=0 contains whole RegExMatch
GO

SET STATISTICS TIME OFF
SET STATISTICS IO OFF
GO

In the CLR RegEx version is used following regular expression : ‘(d+?)(?:,|$)’. It could be also possible to use ‘d+’ in this situation.

Results for 16 elements in string

--===  T-SQL Version  ===
SQL Server parse and compile time:
   CPU time = 15 ms, elapsed time = 26 ms.
Table 'CsvTest'. Scan count 1, logical reads 262, physical reads 2, read-ahead reads 281, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 1841 ms,  elapsed time = 1922 ms.
SQL Server parse and compile time:
   CPU time = 0 ms, elapsed time = 1 ms.

--===  CLR-RegEx Version  ===
SQL Server parse and compile time:
   CPU time = 16 ms, elapsed time = 22 ms.
Table 'CsvTest'. Scan count 5, logical reads 781, physical reads 5, read-ahead reads 281, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 77173 ms,  elapsed time = 25480 ms.
SQL Server parse and compile time:
   CPU time = 0 ms, elapsed time = 0 ms.

and SQL Profiler output:

16 elements string splitting profiler output

Here we can see, that the T-SQL version is unbeatable. The T-SQL took about 2 seconds, while the CLR-RegEx took more than 25 seconds. We can also notice, that the T-SQL version had nearly 3 times less reads.

Results for 100 elements in string

--=== T-SQL Version ===
SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 27 ms.
Table 'CsvTest3'. Scan count 1, logical reads 1675, physical reads 4, read-ahead reads 1677, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
    CPU time = 19890 ms, elapsed time = 20019 ms.
SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 0 ms. 

--=== CLR-RegEx Version ===
SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 18 ms.
Table 'CsvTest3'. Scan count 5, logical reads 1836, physical reads 5, read-ahead reads 1677, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
    CPU time = 79639 ms, elapsed time = 26602 ms.
SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 1 ms.

and SQL Profiler output:

100 elements string splitting profiler output

Here we can see that the duration for T-SQL version rapidly increased to 20 seconds, while the duration of CRL-RegEx is nearly the same as for 16 elements in string not a whole 26 and half seconds. Also Reds are nearly the same.

Results for 1333 elements in string

--=== T-SQL Version ===
SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 25 ms.
Table 'CsvTest2'. Scan count 1, logical reads 74, physical reads 2, read-ahead reads 72, lob logical reads 367392280, lob physical reads 2500, lob read-ahead reads 0. 

SQL Server Execution Times:
    CPU time = 2745696 ms, elapsed time = 2758753 ms.
SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 0 ms. 

--=== CLR-RegEx Version ===
SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 43 ms.
Table 'CsvTest2'. Scan count 5, logical reads 217, physical reads 5, read-ahead reads 72, lob logical reads 320860, lob physical reads 2494, lob read-ahead reads 170000.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
    CPU time = 170228 ms, elapsed time = 59624 ms.
SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 0 ms.

And SQL Profiler output:

1333 elements string splitting profiler output

From this results we can see, that when the length of the string increased and number of elements increased to 1333, the T-SQL version is nearly unusable and the CLR-RegEx is unbeatable by the T-SQL. Here the T-SQL solution took nearly 46 minutes while the CLR-RegEx took only 1 minute. Also the T-SQL solution now has 1 144 times more reads.

CLR-RegEx Testing of very long strings with hi elements count

Here are some test of the CLR-RegEx solution on some very large strings. This test splits single very long string using the CLR-RegEx. The number of elements in the CSV controls the TOP clause in below testing query. We will use it for several scenarios.

DECLARE @csv varchar(max);
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;

WITH TallyBase AS (
    SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL
    SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N UNION ALL SELECT 1 N
),
Tally AS (
    SELECT
        CAST(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS INT) as N
    FROM TallyBase t1, TallyBase t2, TallyBase T3, TallyBase T4
)
SELECT @csv = CAST(STUFF( --=== STUFF get`s rid of the leading comma
        ( --=== This builds CSV row with a leading comma
        SELECT TOP (100000) --Controls the number of CSV elements in each row
            ',' + CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
        FROM Tally t3      --Classic cross join pseudo-cursor
        CROSS JOIN Tally t4 --can produce row sets up 121 million.
        FOR XML PATH('')
        )
        ,1,1,'') AS NVARCHAR(MAX))

SET STATISTICS TIME ON

SELECT
    @ItemNumber = V.matchId,
    @ItemValue = CAST(V.value as int)
FROM dbo.fn_RegExMatches(@csv, '(d+?)(?:,|$)') V
WHERE V.groupId = 1 --Return only GroupID=1 as this group contains the value GroupID=0 contains whole RegExMatch
GO

SET STATISTICS TIME OFF

100 000 elements

SQL Server Execution Times:

CPU time = 484 ms,  elapsed time = 496 ms.

SQL Server parse and compile time:

CPU time = 0 ms, elapsed time = 0 ms.

100 000 elements string splitting profiler output

1 000 000 elements

SQL Server Execution Times:

CPU time = 4352 ms,  elapsed time = 20713 ms.

SQL Server parse and compile time:

CPU time = 0 ms, elapsed time = 0 ms.

1 000 000 elements string splitting profiler output

10 000 000 elements

SQL Server Execution Times:

CPU time = 44320 ms,  elapsed time = 88735 ms.

SQL Server parse and compile time:

CPU time = 0 ms, elapsed time = 0 ms.

10 000 000 elements string splitting profiler output

Conclusion

From the test and measuring done in this blog post we can find, that when we need to split relatively small strings with small elements count (<100 elements) then the T-SQL Solution is the quickest possible and unbeatable.

When we have a moderate count of elements in the string (about 100) then it does not matter if we use the T-SQL or CLR-RegEx solution.

Once the count of elements in the string increases (more than 100) then the CLR-RegEx starts being unbeatable and with high counts of elements (more than 1000, or even millions of elements) then the T-SQL solution is nearly unusable as we can see the results of 10 000 rows with 1333 elements each – the T-SQL version took 46 minutes, while the CLR-RegEx took only 1 minute.

Also during all the tests, there was no occurrence of AppDomain Restart/(Unload/Load) when manipulating large strings using the CLR-RegEx solution. On some blogs I have found, that there was some AppDomain unloading and loading when using CLR but with combination of String.Split method. The RegEx solution doesn’t have such problems.

SQL Server 2005 and SQL Server 2008 Regular Expressions and pattern matching

To write this blog post I was inspired by a question on Ask SQL Server Central.com about finding a substring between two characters.

Some times it is necessary to process input string and retrieve some parts of them. For this purposes we can write some more complex T-SQL constructs or use the CLR functionality of SQL Server 2005 and SQL Server 2008 and write some functions for processing Regular Expressions.

Generally all the string manipulations (especially on very large string values) are much better performing using a CLR scalar and table-valued functions than their equivalent using pure T-SQL code. Also using regular expression you do not need to write specialized functions for particular processing as you simply write a regular expression and use it in the RegEx function.

For general info about writing a CLR functions read MSDN.

More info on regular expressions you can find on Regular-Expressions.info and you can find inspiration also in the Regular Expression Library.

Implementation

For such use you can find a below SQLRegEx class which contains two functions – RegExMatch with is a scalar CLR function to return particular group from within particular match and CLR table-valued function RegExMatches which return all matches and groups.

Below is a code snippet you can use to create an assembly.

public class SQLRegEx
{
    private class RegExRow
    {
        /// <summary>
        /// Private class for passing matches of the RegExMatches to the FillRow method
        /// </summary>
        /// <param name=”rowId”>ID of the Row</param>
        /// <param name=”matchId”>ID of the Match</param>
        /// <param name=”groupID”>ID of the Group within the Match</param>
        /// <param name=”value”>Value of the particular group</param>
        public RegExRow(int rowId, int matchId, int groupID, string value)
        {
            RowId = rowId;
            MatchId = matchId;
            GroupID = groupID;
            Value = value;
        }

        public int RowId;
        public int MatchId;
        public int GroupID;
        public string Value;
    }

    /// <summary>
    /// Applies Regular Expression on the Source string and returns value of particular group from withing a specified match
    /// </summary>
    /// <param name=”sourceString”>Source string on which the regular expression should be applied</param>
    /// <param name=”pattern”>Regular Expression pattern</param>
    /// <param name=”matchId”>ID of the Match to be returned 1 inex-based</param>
    /// <param name=”groupId”>ID of the group from within a match to return. GroupID 0 returns complete match</param>
    /// <returns>Value of the Group from within a Match</returns>
    [SqlFunction(IsDeterministic=true)]
    public static SqlChars RegExMatch(string sourceString, string pattern, int matchId, int groupId)
    {
        Match m = null;
        Regex r = new Regex(pattern, RegexOptions.Compiled);

        if (matchId == 1)
        {
            m = r.Match(sourceString);
        }
        else if (matchId > 1)
        {
            MatchCollection mc = r.Matches(sourceString);
            m = mc != null && mc.Count > matchId – 1 ? mc[matchId - 1] : null;
        }

        return m != null && m.Groups.Count > groupId ? new SqlChars(m.Groups[groupId].Value) : SqlChars.Null;
    }

    /// <summary>
    /// Applies Regular Expression o the Source strings and return all matches and groups
    /// </summary>
    /// <param name=”sourceString”>Source string on which the regular expression should be applied</param>
    /// <param name=”pattern”>Regular Expression pattern</param>
    /// <returns>Returns list of RegExRows representing the group value</returns>
    [SqlFunction(FillRowMethodName = "FillRegExRow")]
    public static IEnumerable RegExMatches(string sourceString, string pattern)
    {
        Regex r = new Regex(pattern, RegexOptions.Compiled);
        int rowId = 0;
        int matchId = 0;
        foreach (Match m in r.Matches(sourceString))
        {
            matchId++;
            for (int i = 0; i < m.Groups.Count; i++)
            {
                yield return new RegExRow(++rowId, matchId, i, m.Groups[i].Value);
            }
        }
    }

    /// <summary>
    /// FillRow method to populate the output table
    /// </summary>
    /// <param name=”obj”>RegExRow passed as object</param>
    /// <param name=”rowId”>ID or the returned row</param>
    /// <param name=”matchId”>ID of returned Match</param>
    /// <param name=”groupID”>ID of group in the Match</param>
    /// <param name=”value”>Value of the Group</param>
    public static void FillRegExRow(Object obj, out int rowId, out int matchId, out int groupID, out SqlChars value)
    {
        RegExRow r = (RegExRow)obj;
        rowId = r.RowId;
        matchId = r.MatchId;
        groupID = r.GroupID;
        value = new SqlChars(r.Value);
    }
}

RegExMatch function

This is a scalar function which takes three arguments. First is the source string to be parsed using regular expressions. Second parameter is regular expression itself, third is a matchId (counted from 1 and which represents a match to be returned) to be returned if there could be more matches and fourth is a Group ID to be returned from within the regular expression match. When there is no group or you want to return simply the complete Match, pass 0.

RegExMatches function

This is CLR table function and it returns all matches including each group in each match. The complete Match has groupID equal to 0 in the output. Input parameters are source string to be parsed using the regular expression a second is the regular expression pattern.

Once we have a compiled assembly, we can register the assembly and function in database and start using them for various Regular Expressions processing.

CREATE ASSEMBLY [SQLRegEx]
AUTHORIZATION [dbo]
FROM 'C:CLRSQLRegEx.dll'
WITH PERMISSION_SET = SAFE
GO

CREATE FUNCTION [dbo].[fn_RegExMatch](
    @sourceString nvarchar(max), --Source string to be processed by regular expression
    @pattern nvarchar(4000), --Regular expression (pattern) to be applied on the source string
    @matchId int, --ID of the match to be returned (if there are more matches, you can return a concrete one). Numbering starts from 1
    @groupID int --ID of the group in the match to be returned. GroupID = 0 represents a complete match.
)
RETURNS nvarchar(4000)
WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [SQLRegEx].[SQLRegEx].[RegExMatch]
GO

CREATE FUNCTION [dbo].[fn_RegExMatches](
    @sourceString nvarchar(max), --Source string to be processed by regular expression
    @pattern nvarchar(4000) --Regular expression (pattern) to be applied on the source string
)
RETURNS TABLE (
    [rowId] int, --RowId each row as it`s ID
    [matchId] int, --ID of particular match (starts from 1)
    [groupId] int, --ID of particular group in RegEx match (GroupID = 0) represents a complete match
    [value] nvarchar(4000) --value of the group
) WITH EXECUTE AS CALLER
EXTERNAL NAME [SQLRegEx].[SQLRegEx].[RegExMatches]
GO

RegExMatches expamples

Eg. Split semicolon separated list of Key=Value pairs. Pair has the same matchId and Key has groupID = 1 and value has groupID=2. Without the WHERE Condition also the GroupID 0 will be returned which contains each pair.

Split Key=Value pairs
SELECT
*
FROM dbo.fn_RegExMatches('Key1=Val1;Key2=Val2;Key3=Val3;Key4=Val4', 's*(w+)s*=s*?(.*?)(?=;|z)')
WHERE groupId > 0
rowId       matchId     groupId     value
———– ———– ———– ——
2           1           1           Key1
3           1           2           Val1
5           2           1           Key2
6           2           2           Val2
8           3           1           Key3
9           3           2           Val3
11          4           1           Key4
Splitting comma separated list
SELECT
    *
FROM dbo.fn_RegExMatches('1,2,3,4,5,6,7,8', '(d+?)(?:,|z)')
WHERE groupId = 1

--or even this and doesn’t matter what separator is used
SELECT
    *
FROM dbo.fn_RegExMatches('1,2,3,4,5;6,7,8', 'w+')
List of words in sentence
SELECT
    *
FROM dbo.fn_RegExMatches('This is a sample sentence to Split.', 'w+')

RegExMatch examples

Value of the Key=Value pair
SELECT dbo.fn_RegExMatch('Key = Value' ,'s*w+s*=s*(.*)', 1, 1)
Fourth word in the string
SELECT dbo.fn_RegExMatch('This is a sample sentence to Split.', 'w+', 4, 0)
Sixth number in list
SELECT dbo.fn_RegExMatch('1,2,3,4,5;6,7,8', 'w+', 6, 0)
Email address validation
--validating email address
DECLARE @emailAddress nvarchar(20)
SET @emailAddress = 'name@domain.com'
SELECT
    CASE
        WHEN
            @emailAddress =
            dbo.fn_RegExMatch(@emailAddress, '([a-zA-Z0-9_-.]+)@(([[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.)|(([a-zA-Z0-9-]+.)+))([a-zA-Z]{2,4}|[0-9]{1,3})', 1, 0)
        THEN 1
        ELSE 0
    END AS ValidEmail

--lists all invalid email
WITH emails AS (
    SELECT 'name@domain.com' AS email UNION ALL
    SELECT 'name@subdomain.domain.com' UNION All
    SELECT 'name@this.email.is.wrong' UNION ALL
    SELECT 'another.wrong@@domain.com'
)
SELECT
    *
FROM emails
WHERE email <> dbo.fn_RegExMatch(email, '([a-zA-Z0-9_-.]+)@(([[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.)|(([a-zA-Z0-9-]+.)+))([a-zA-Z]{2,4}|[0-9]{1,3})', 1, 0)
Get day from string date
WITH StringDates AS (
    SELECT '2010/09/3' AS StringDate UNION ALL
    SELECT '2010/07/09' UNION ALL
    SELECT '2009/1/15'
)
SELECT
    StringDate,
    dbo.fn_RegExMatch(StringDate, '(d{4})/(d{1,2})/(d{1,2})', 1, 3) AS DateDay
FROM StringDates

Summary

Hope, this article will help you using regular expressions in SQL Server and you will be able to simplify and speed-up your queries.

You may be also interested in my next post SQL Server Regular Expressions – Replacements which describes possibilities of string replacements using regular expressions.

Converting AS400 EBCDIC 870 into Win1250 code page on SQL Server 2005

I have to read data from AS400 machine using an ODBC driver under SQL Server 2005.  The data on the AS400 machine are encoded using EBCDIC 870 – Multilingual/ROECE (Latin-2) code page and contains Czech diacritics.

Unfortunately,SQL Server 2005 have no native support for EBCDIC 870 code page and when you read the data, you receive wrong characters.

There are some EBCDIC character sets supported by SQL server, but not the 870 one.

select 
    * 
from fn_helpcollations()

I was googling and binging for some time, but didn’t find any suitable solution, if you need to select live data from the AS400 machine.

Finally I was looking deeper into to the problem and found an acceptable work around.

The ODBC driver and AS400 support HEX() function, which returns sequence of hexadecimal codes representing the value.

SELECT 
    * 
FROM OPENQUERY(AS400, 'SELECT HEX(FieldName) AS FieldName FROM AS400Table')
Once we have an hexadecimal sequence, we can handle it on the SQL Server side in a way we want.

You can crate an CLR function which will allow you to convert the source CBCDIC 870 code page to the destination code page of you choice.

[SqlFunction()]
public static SqlString fnEBCDIC870ToSql(SqlString hexBytes, int destinationCodePage)
{
    if (hexBytes.IsNull) return hexBytes;
    string str = hexBytes.ToString();
    byte[] buffer = new byte[str.Length / 2];

    Encoding destinationEncoding = Encoding.GetEncoding(destinationCodePage);
    for (int i = 0; i < buffer.Length; i++)
    {
        buffer[i] = byte.Parse(str.Substring(i * 2, 2), NumberStyles.HexNumber);
    } 

    buffer = Encoding.Convert(Encoding.GetEncoding(870), destinationEncoding, buffer);
    str = destinationEncoding.GetString(buffer);
    return new SqlString(str);
}

If you cannot use the CLR function on the SQL server or Database, you can simply create an Conversion table and conversion scalar function for this conversion.

for conversion to code page 1250:

CREATE FUNCTION [dbo].[fnEBCDIC870ToWin1250] ( 
    @pHex varchar(8000) 
) 
RETURNS varchar(4000) 
AS 
BEGIN 
    DECLARE @result varchar(4000) 

    SET @result = '';

    WITH Tally AS (
        SELECT
            ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) As RowNo
        FROM sys.all_columns
    )
    SELECT
        @result = @result + ISNULL(C.Character, '?') 
    FROM Tally
    LEFT JOIN CodePage870 C ON UPPER(SUBSTRING(@pHex, (RowNo - 1) * 2 + 1, 2)) = C.HexByte
    WHERE RowNo <= LEN(@pHex) / 2
    ORDER BY RowNo

    RETURN @result
END

The function uses a conversion table CodePage870.

CREATE TABLE CodePage870 (
    HexByte char(2) NOT NULL,
    Character char(1) NOT NULL,
    CONSTRAINT PK_CodePage870 PRIMARY KEY (HexByte)
)
INSERT CodePage870 (
    HexByte,
    Character
)
SELECT '40' As HexByte, ' ' As Character UNION ALL
SELECT '41' As HexByte, ' ' As Character UNION ALL
SELECT '42' As HexByte, 'â' As Character UNION ALL
SELECT '43' As HexByte, 'ä' As Character UNION ALL
SELECT '44' As HexByte, 'ţ' As Character UNION ALL
SELECT '45' As HexByte, 'á' As Character UNION ALL
SELECT '46' As HexByte, 'ă' As Character UNION ALL
SELECT '47' As HexByte, 'č' As Character UNION ALL
SELECT '48' As HexByte, 'ç' As Character UNION ALL
SELECT '49' As HexByte, 'ć' As Character UNION ALL
SELECT '4A' As HexByte, '[' As Character UNION ALL
SELECT '4B' As HexByte, '.' As Character UNION ALL
SELECT '4C' As HexByte, '<' As Character UNION ALL
SELECT '4D' As HexByte, '(' As Character UNION ALL
SELECT '4E' As HexByte, '+' As Character UNION ALL
SELECT '4F' As HexByte, '!' As Character UNION ALL
SELECT '50' As HexByte, '&' As Character UNION ALL
SELECT '51' As HexByte, 'é' As Character UNION ALL
SELECT '52' As HexByte, 'ę' As Character UNION ALL
SELECT '53' As HexByte, 'ë' As Character UNION ALL
SELECT '54' As HexByte, 'ů' As Character UNION ALL
SELECT '55' As HexByte, 'í' As Character UNION ALL
SELECT '56' As HexByte, 'î' As Character UNION ALL
SELECT '57' As HexByte, 'ľ' As Character UNION ALL
SELECT '58' As HexByte, 'ĺ' As Character UNION ALL
SELECT '59' As HexByte, 'ß' As Character UNION ALL
SELECT '5A' As HexByte, ']' As Character UNION ALL
SELECT '5B' As HexByte, '$' As Character UNION ALL
SELECT '5C' As HexByte, '*' As Character UNION ALL
SELECT '5D' As HexByte, ')' As Character UNION ALL
SELECT '5E' As HexByte, ';' As Character UNION ALL
SELECT '5F' As HexByte, '^' As Character UNION ALL
SELECT '60' As HexByte, '-' As Character UNION ALL
SELECT '61' As HexByte, '/' As Character UNION ALL
SELECT '62' As HexByte, 'Â' As Character UNION ALL
SELECT '63' As HexByte, 'Ä' As Character UNION ALL
SELECT '64' As HexByte, '˝' As Character UNION ALL
SELECT '65' As HexByte, 'Á' As Character UNION ALL
SELECT '66' As HexByte, 'Ă' As Character UNION ALL
SELECT '67' As HexByte, 'Č' As Character UNION ALL
SELECT '68' As HexByte, 'Ç' As Character UNION ALL
SELECT '69' As HexByte, 'Ć' As Character UNION ALL
SELECT '6A' As HexByte, '|' As Character UNION ALL
SELECT '6B' As HexByte, ',' As Character UNION ALL
SELECT '6C' As HexByte, '%' As Character UNION ALL
SELECT '6D' As HexByte, '_' As Character UNION ALL
SELECT '6E' As HexByte, '>' As Character UNION ALL
SELECT '6F' As HexByte, '?' As Character UNION ALL
SELECT '70' As HexByte, 'ˇ' As Character UNION ALL
SELECT '71' As HexByte, 'É' As Character UNION ALL
SELECT '72' As HexByte, 'Ę' As Character UNION ALL
SELECT '73' As HexByte, 'Ë' As Character UNION ALL
SELECT '74' As HexByte, 'Ů' As Character UNION ALL
SELECT '75' As HexByte, 'Í' As Character UNION ALL
SELECT '76' As HexByte, 'Î' As Character UNION ALL
SELECT '77' As HexByte, 'Ľ' As Character UNION ALL
SELECT '78' As HexByte, 'Ĺ' As Character UNION ALL
SELECT '79' As HexByte, '`' As Character UNION ALL
SELECT '7A' As HexByte, ':' As Character UNION ALL
SELECT '7B' As HexByte, '#' As Character UNION ALL
SELECT '7C' As HexByte, '@' As Character UNION ALL
SELECT '7D' As HexByte, '''' As Character UNION ALL
SELECT '7E' As HexByte, '=' As Character UNION ALL
SELECT '7F' As HexByte, '"' As Character UNION ALL
SELECT '80' As HexByte, '˘' As Character UNION ALL
SELECT '81' As HexByte, 'a' As Character UNION ALL
SELECT '82' As HexByte, 'b' As Character UNION ALL
SELECT '83' As HexByte, 'c' As Character UNION ALL
SELECT '84' As HexByte, 'd' As Character UNION ALL
SELECT '85' As HexByte, 'e' As Character UNION ALL
SELECT '86' As HexByte, 'f' As Character UNION ALL
SELECT '87' As HexByte, 'g' As Character UNION ALL
SELECT '88' As HexByte, 'h' As Character UNION ALL
SELECT '89' As HexByte, 'i' As Character UNION ALL
SELECT '8A' As HexByte, 'ś' As Character UNION ALL
SELECT '8B' As HexByte, 'ň' As Character UNION ALL
SELECT '8C' As HexByte, 'đ' As Character UNION ALL
SELECT '8D' As HexByte, 'ý' As Character UNION ALL
SELECT '8E' As HexByte, 'ř' As Character UNION ALL
SELECT '8F' As HexByte, 'ş' As Character UNION ALL
SELECT '90' As HexByte, '°' As Character UNION ALL
SELECT '91' As HexByte, 'j' As Character UNION ALL
SELECT '92' As HexByte, 'k' As Character UNION ALL
SELECT '93' As HexByte, 'l' As Character UNION ALL
SELECT '94' As HexByte, 'm' As Character UNION ALL
SELECT '95' As HexByte, 'n' As Character UNION ALL
SELECT '96' As HexByte, 'o' As Character UNION ALL
SELECT '97' As HexByte, 'p' As Character UNION ALL
SELECT '98' As HexByte, 'q' As Character UNION ALL
SELECT '99' As HexByte, 'r' As Character UNION ALL
SELECT '9A' As HexByte, 'ł' As Character UNION ALL
SELECT '9B' As HexByte, 'ń' As Character UNION ALL
SELECT '9C' As HexByte, 'š' As Character UNION ALL
SELECT '9D' As HexByte, '¸' As Character UNION ALL
SELECT '9E' As HexByte, '˛' As Character UNION ALL
SELECT '9F' As HexByte, '¤' As Character UNION ALL
SELECT 'A0' As HexByte, 'ą' As Character UNION ALL
SELECT 'A1' As HexByte, '~' As Character UNION ALL
SELECT 'A2' As HexByte, 's' As Character UNION ALL
SELECT 'A3' As HexByte, 't' As Character UNION ALL
SELECT 'A4' As HexByte, 'u' As Character UNION ALL
SELECT 'A5' As HexByte, 'v' As Character UNION ALL
SELECT 'A6' As HexByte, 'w' As Character UNION ALL
SELECT 'A7' As HexByte, 'x' As Character UNION ALL
SELECT 'A8' As HexByte, 'y' As Character UNION ALL
SELECT 'A9' As HexByte, 'z' As Character UNION ALL
SELECT 'AA' As HexByte, 'Ś' As Character UNION ALL
SELECT 'AB' As HexByte, 'Ň' As Character UNION ALL
SELECT 'AC' As HexByte, 'Đ' As Character UNION ALL
SELECT 'AD' As HexByte, 'Ý' As Character UNION ALL
SELECT 'AE' As HexByte, 'Ř' As Character UNION ALL
SELECT 'AF' As HexByte, 'Ş' As Character UNION ALL
SELECT 'B0' As HexByte, '˙' As Character UNION ALL
SELECT 'B1' As HexByte, 'Ą' As Character UNION ALL
SELECT 'B2' As HexByte, 'ż' As Character UNION ALL
SELECT 'B3' As HexByte, 'Ţ' As Character UNION ALL
SELECT 'B4' As HexByte, 'Ż' As Character UNION ALL
SELECT 'B5' As HexByte, '§' As Character UNION ALL
SELECT 'B6' As HexByte, 'ž' As Character UNION ALL
SELECT 'B7' As HexByte, 'ź' As Character UNION ALL
SELECT 'B8' As HexByte, 'Ž' As Character UNION ALL
SELECT 'B9' As HexByte, 'Ź' As Character UNION ALL
SELECT 'BA' As HexByte, 'Ł' As Character UNION ALL
SELECT 'BB' As HexByte, 'Ń' As Character UNION ALL
SELECT 'BC' As HexByte, 'Š' As Character UNION ALL
SELECT 'BD' As HexByte, '¨' As Character UNION ALL
SELECT 'BE' As HexByte, '´' As Character UNION ALL
SELECT 'BF' As HexByte, '×' As Character UNION ALL
SELECT 'C0' As HexByte, '{' As Character UNION ALL
SELECT 'C1' As HexByte, 'A' As Character UNION ALL
SELECT 'C2' As HexByte, 'B' As Character UNION ALL
SELECT 'C3' As HexByte, 'C' As Character UNION ALL
SELECT 'C4' As HexByte, 'D' As Character UNION ALL
SELECT 'C5' As HexByte, 'E' As Character UNION ALL
SELECT 'C6' As HexByte, 'F' As Character UNION ALL
SELECT 'C7' As HexByte, 'G' As Character UNION ALL
SELECT 'C8' As HexByte, 'H' As Character UNION ALL
SELECT 'C9' As HexByte, 'I' As Character UNION ALL
SELECT 'CA' As HexByte, '­' As Character UNION ALL
SELECT 'CB' As HexByte, 'ô' As Character UNION ALL
SELECT 'CC' As HexByte, 'ö' As Character UNION ALL
SELECT 'CD' As HexByte, 'ŕ' As Character UNION ALL
SELECT 'CE' As HexByte, 'ó' As Character UNION ALL
SELECT 'CF' As HexByte, 'ő' As Character UNION ALL
SELECT 'D0' As HexByte, '}' As Character UNION ALL
SELECT 'D1' As HexByte, 'J' As Character UNION ALL
SELECT 'D2' As HexByte, 'K' As Character UNION ALL
SELECT 'D3' As HexByte, 'L' As Character UNION ALL
SELECT 'D4' As HexByte, 'M' As Character UNION ALL
SELECT 'D5' As HexByte, 'N' As Character UNION ALL
SELECT 'D6' As HexByte, 'O' As Character UNION ALL
SELECT 'D7' As HexByte, 'P' As Character UNION ALL
SELECT 'D8' As HexByte, 'Q' As Character UNION ALL
SELECT 'D9' As HexByte, 'R' As Character UNION ALL
SELECT 'DA' As HexByte, 'Ě' As Character UNION ALL
SELECT 'DB' As HexByte, 'ű' As Character UNION ALL
SELECT 'DC' As HexByte, 'ü' As Character UNION ALL
SELECT 'DD' As HexByte, 'ť' As Character UNION ALL
SELECT 'DE' As HexByte, 'ú' As Character UNION ALL
SELECT 'DF' As HexByte, 'ě' As Character UNION ALL
SELECT 'E0' As HexByte, '' As Character UNION ALL
SELECT 'E1' As HexByte, '÷' As Character UNION ALL
SELECT 'E2' As HexByte, 'S' As Character UNION ALL
SELECT 'E3' As HexByte, 'T' As Character UNION ALL
SELECT 'E4' As HexByte, 'U' As Character UNION ALL
SELECT 'E5' As HexByte, 'V' As Character UNION ALL
SELECT 'E6' As HexByte, 'W' As Character UNION ALL
SELECT 'E7' As HexByte, 'X' As Character UNION ALL
SELECT 'E8' As HexByte, 'Y' As Character UNION ALL
SELECT 'E9' As HexByte, 'Z' As Character UNION ALL
SELECT 'EA' As HexByte, 'ď' As Character UNION ALL
SELECT 'EB' As HexByte, 'Ô' As Character UNION ALL
SELECT 'EC' As HexByte, 'Ö' As Character UNION ALL
SELECT 'ED' As HexByte, 'Ŕ' As Character UNION ALL
SELECT 'EE' As HexByte, 'Ó' As Character UNION ALL
SELECT 'EF' As HexByte, 'Ő' As Character UNION ALL
SELECT 'F0' As HexByte, '0' As Character UNION ALL
SELECT 'F1' As HexByte, '1' As Character UNION ALL
SELECT 'F2' As HexByte, '2' As Character UNION ALL
SELECT 'F3' As HexByte, '3' As Character UNION ALL
SELECT 'F4' As HexByte, '4' As Character UNION ALL
SELECT 'F5' As HexByte, '5' As Character UNION ALL
SELECT 'F6' As HexByte, '6' As Character UNION ALL
SELECT 'F7' As HexByte, '7' As Character UNION ALL
SELECT 'F8' As HexByte, '8' As Character UNION ALL
SELECT 'F9' As HexByte, '9' As Character UNION ALL
SELECT 'FA' As HexByte, 'Ď' As Character UNION ALL
SELECT 'FB' As HexByte, 'Ű' As Character UNION ALL
SELECT 'FC' As HexByte, 'Ü' As Character UNION ALL
SELECT 'FD' As HexByte, 'Ť' As Character UNION ALL
SELECT 'FE' As HexByte, 'Ú' As Character UNION ALL
SELECT 'FF' As HexByte, '?' As Character

The Script was generated by simple C# method.

public static void CreateEBCDICScript(string destinationFileName, int destinationCodePage)
{
    StringBuilder sb = new StringBuilder();
    byte[] buffer = new byte[1];
    Encoding ebcdic = Encoding.GetEncoding(870);
    Encoding destinationEncoding = Encoding.GetEncoding(destinationCodePage);    

    for (int i = 64; i < 256; i++)
    {
        buffer[0] = (byte)i;
        buffer = Encoding.Convert(ebcdic, destinationEncoding, buffer);
        string target = destinationEncoding.GetString(buffer);
        sb.AppendFormat("SELECT '{0:X}' AS HexByte, '{1}' AS Character UNION ALL", i, target.Equals("'") ? "''" : target);
        sb.AppendLine();
    }    

    File.WriteAllText(destinationFileName, sb.ToString();
}
You can use it to generate script for whatever encoding you want.

Finally as you have the conversion functions in database you can use below selects to receive correctly encoded strings from within the AS400 machine.

--CLR Version 
SELECT 
    fnECDIC870ToSql(FieldName, 1250) As FieldName 
FROM OPENQUERY(AS400, 'SELECT HEX(FieldName) AS FieldName FROM AS400Table')

--T-SQL Version 
SELECT 
    fnECDIC870ToWin1250(FieldName) As FieldName 
FROM OPENQUERY(AS400, 'SELECT HEX(FieldName) AS FieldName FROM AS400Table')