Exporting CLR Assemblies from SQL Server back to .dll files

2014-01-08 Pavel Pawlowski CLR, SQL Server assembly, CLR, export, SQL Server

On mssqltips.com you can find my post Exporting CLR Assemblies from SQL Server back to .dll files.

This post describes how you can export the assemblies from SQL server back to physical .dll files. This can be very useful in cases when you do not have the source code for he assemblies inside the DB and you would like to do a security review of the assembly. This commonly happens if you receive the assembly as BYTE code in T-SQL.

Removing Accent (diacritics) using CLR

2012-01-272016-10-08 Pavel Pawlowski CLR, SQL Server accent, CLR, SQL Server, string

Some times it may happen, that you need to remove accent (diacritics) from string when you are querying data or when you are storing data into particular table.

There are several methods how to achieve this. One of the possible solution is using a CONVERT with COLLATE as Twitted by MVP Mladen Prajdic and mentioned on Luke Jian blog, but this method has an issue I will mention below.

Other solution could be replacement of accented characters with the ones without accent. But this is a very problematic solution as you have to do a lot of replacement (the performance can suffer) and it will be problematic to cover all the possible accent characters and you easily miss one. Similar to this could be creation of a replacement table and implementation of function similar to Splitting function using Tally Table (Jeff Moden). But again you will have to write all possible combination of accented and not accented characters into the replacement table.

So if you do not want to change the collation and would like to avoid possible issues of the CONVERT method and avoid writing replacement tables etc., you can easily use a very simple amd known CLR method. This method is using string normalization and CharUnicodeInfo class from the System.Globalization name space. Based on this you can write a very simple scalar CLR function.

using System;
using System.Collections.Generic;
using System.Text;
using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;
using System.Globalization;

public class StringsCommon
{
    /// <summary>
    /// Removes Accent (Diacritics) from string
    /// </summary>
    /// <param name="sourceString">Source string fro wchich accent should be removed</param>
    /// <returns>string without accent</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlChars RemoveAccent(SqlString sourceString)
    {
        if (sourceString.IsNull)
            return SqlChars.Null;

        string normalized = sourceString.Value.Normalize(NormalizationForm.FormD);

        StringBuilder output = new StringBuilder(sourceString.Value.Length);

        foreach (char ch in normalized)
        {
            if (CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.NonSpacingMark)
                output.Append(ch);
        }
        return new SqlChars(output.ToString());
    }
}

Once you compile above mentioned function into an .net assembly, you can register the assembly and function in your database.

CREATE ASSEMBLY [PPSqlClrSafe]
AUTHORIZATION [dbo]
FROM 'C:CLRPPSqlClrSafe.dll'
WITH PERMISSION_SET = SAFE
GO

CREATE FUNCTION [ClrSafe].[fn_RemoveAccent](
	@sourceString [nvarchar](max)  --Source string to remove accent
)
RETURNS [nvarchar](4000)
WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [PPSqlClrSafe].[StringsCommon].[RemoveAccent]
GO</pre>
</div>
Once the function is registered, you can easily use it to remove accent (diacritics) from whatever string you want
<div style="max-height:400px;overflow:auto;">
<pre class="brush:sql">WITH Data AS (
    SELECT
        'Czech' AS [Language]
        ,N'á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů' AS [AccentChars] UNION ALL
    SELECT
        'Norwegian' AS [Language]
        ,N'ø' AS [AccentChars] UNION ALL
    SELECT
        'Estonian' AS [Language]
        ,N'õ' AS [AccentChars] UNION ALL
    SELECT
        'French' AS [Language]
        ,N'à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü' AS [AccentChars] UNION ALL
    SELECT
        'Croatian, Vietnamese' AS [Language]
        ,N'đ' AS [AccentChars] UNION ALL
    SELECT
        'Latvian' AS [Language]
        ,N'ā, ē, ī, ū, ļ, ķ, ņ, ģ' AS [AccentChars] UNION ALL
    SELECT
        'Hungarian' AS [Language]
        ,N'ö, ü, ő, ű' AS [AccentChars] UNION ALL
    SELECT
        'Polish' AS [Language]
        ,N'ą, ć, ę, ń, ó, ś, ź, ż' AS [AccentChars] UNION ALL
    SELECT
        'Romanian' AS [Language]
        ,N'ă, â, î, ș, ț' AS [AccentChars] UNION ALL
    SELECT
        'Slovak' AS [Language]
        ,N'ô, ť, ľ, ŕ, ĺ, ä' AS [AccentChars] UNION ALL
    SELECT
        'Spanish' AS [Language]
        ,N'ñ' AS [AccentChars] UNION ALL
    SELECT
        'Swedish' AS [Language]
        ,N'å' AS [AccentChars] UNION ALL
    SELECT
        'Turkish' AS [Language]
        ,N'ç, ş, ğ' AS [AccentChars] UNION ALL
    SELECT
        'Greece' AS [Language]
        ,N'Γ, δ, ξ, Φ' AS [AccentChars]
)
SELECT
    [Language]
    ,[AccentChars]
    ,[ClrSafe].fn_RemoveAccent([AccentChars]) AS [RemovedAccent]
FROM Data

Which produces below result:

Language             AccentChars                                 RemovedAccent
-------------------- ------------------------------------------- --------------------------------------------
Czech                á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů a, e, i, o, u, y, c, d, e, n, r, s, t, z, u
Norwegian            ø                                           ø
Estonian             õ                                           o
French               à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü       a, a, c, e, e, e, e, i, i, o, u, u, u
Croatian, Vietnamese đ                                           đ
Latvian              ā, ē, ī, ū, ļ, ķ, ņ, ģ                      a, e, i, u, l, k, n, g
Hungarian            ö, ü, ő, ű                                  o, u, o, u
Polish               ą, ć, ę, ń, ó, ś, ź, ż                      a, c, e, n, o, s, z, z
Romanian             ă, â, î, ș, ț                               a, a, i, s, t
Slovak               ô, ť, ľ, ŕ, ĺ, ä                            o, t, l, r, l, a
Spanish              ñ                                           n
Swedish              å                                           a
Turkish              ç, ş, ğ                                     c, s, g
Greece               Γ, δ, ξ, Φ                                  Γ, δ, ξ, Φ

When you compare it to the CONVERT with COLLATION method, you can find that this doesn’t have the problem with non existen characters. As the CONVERT converts the characters to particular character set which does not need to contain all the charecters of the source character set. We can see this e.g. on the Norwegian, and Greece characters.

If we use the CONVERT with COLLATION method we receive:

WITH Data AS (
    SELECT
        'Czech' AS [Language]
        ,N'á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů' AS [AccentChars] UNION ALL
    SELECT
        'Norwegian' AS [Language]
        ,N'ø' AS [AccentChars] UNION ALL
    SELECT
        'Estonian' AS [Language]
        ,N'õ' AS [AccentChars] UNION ALL
    SELECT
        'French' AS [Language]
        ,N'à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü' AS [AccentChars] UNION ALL
    SELECT
        'Croatian, Vietnamese' AS [Language]
        ,N'đ' AS [AccentChars] UNION ALL
    SELECT
        'Latvian' AS [Language]
        ,N'ā, ē, ī, ū, ļ, ķ, ņ, ģ' AS [AccentChars] UNION ALL
    SELECT
        'Hungarian' AS [Language]
        ,N'ö, ü, ő, ű' AS [AccentChars] UNION ALL
    SELECT
        'Polish' AS [Language]
        ,N'ą, ć, ę, ń, ó, ś, ź, ż' AS [AccentChars] UNION ALL
    SELECT
        'Romanian' AS [Language]
        ,N'ă, â, î, ș, ț' AS [AccentChars] UNION ALL
    SELECT
        'Slovak' AS [Language]
        ,N'ô, ť, ľ, ŕ, ĺ, ä' AS [AccentChars] UNION ALL
    SELECT
        'Spanish' AS [Language]
        ,N'ñ' AS [AccentChars] UNION ALL
    SELECT
        'Swedish' AS [Language]
        ,N'å' AS [AccentChars] UNION ALL
    SELECT
        'Turkish' AS [Language]
        ,N'ç, ş, ğ' AS [AccentChars] UNION ALL
    SELECT
        'Greece' AS [Language]
        ,N'Γ, δ, ξ, Φ' AS [AccentChars]
)
SELECT
    [Language]
    ,[AccentChars]
    ,CONVERT(varchar(50), [AccentChars]) COLLATE Cyrillic_General_CI_AI AS [RemovedAccent]
FROM Data

Results:

Language             AccentChars                                 RemovedAccent
-------------------- ------------------------------------------- --------------------------------------------
Czech                á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů a, e, i, o, u, y, c, d, e, n, r, s, t, z, u
Norwegian            ø                                           o
Estonian             õ                                           o
French               à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü       a, a, c, e, e, e, e, i, i, o, u, u, u
Croatian, Vietnamese đ                                           d
Latvian              ā, ē, ī, ū, ļ, ķ, ņ, ģ                      a, e, i, u, l, k, n, g
Hungarian            ö, ü, ő, ű                                  o, u, o, u
Polish               ą, ć, ę, ń, ó, ś, ź, ż                      a, c, e, n, o, s, z, z
Romanian             ă, â, î, ș, ț                               a, a, i, ?, ?
Slovak               ô, ť, ľ, ŕ, ĺ, ä                            o, t, l, r, l, a
Spanish              ñ                                           n
Swedish              å                                           a
Turkish              ç, ş, ğ                                     c, s, g
Greece               Γ, δ, ξ, Φ                                  ?, ?, ?, ?

From above we can see, that the CLR solution provides much better results and you do not loose any characters by conversion to non Unicode character set

If you are interested, you can grab sample C# project here: PPSqlClrSafe_AccentRemoval,zip

Any comments are welcomed.

Second Update to Querying Active Directory on SQL Server using CLR

2011-11-30 Pavel Pawlowski CLR, SQL Server Active Directory, CLR, Query, SQL Server

Finally I had a time to make a second update to my post Querying Active Directory on SQL Server using CLR. This update should finally resolve an issue when you need to return large numbers of properties from Active Directory.

The solution is very simple by adding a possibility to specify the columns lengths (maximum length) for the returned properties. All the structures in the code left the same as they were originally. The only change is, that the list of properties is no longer passed as a comma separated list and now semicolon is used to delimit the properties. Comma can be used to provide a column length. If the length is not provided then the default of 4000 is used as it was before.

So don’t hesitate and check the updated article.

I’ve also provided a link to an updated Visual C# project with all the codes.

SQL Server Regular Expressions – Replacements

2011-05-25 Pavel Pawlowski CLR, SQL Server CLR, RegEx, SQL Server

In my earlier post SQL Server 2005 and SQL Server 2008 Regular Expressions and pattern matching I was focusing on the Regular Expressions matching on the SQL Server using a CLR.

This post was focused only on the matches itself, but didn’t cover another great feature of the Regular Expressions and this feature is Replacements.

When using the replacements then first the regular expression is executed on the source string and then on one, several or all matches a replacement is applied and the result of such replacement is returned.

Here are the CLR functions which handle the replacements.

using System;
using System.Collections.Generic;
using System.Text;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
using System.Collections;

/// <summary>
/// Regular expressions for SQL
/// </summary>
public class SQLRegEx
{
    #region Internal structures
    private struct RegExRowReplace
    {
        /// <summary>
        /// Replacement table row
        /// </summary>
        /// <param name="matchID">ID of the match. ID = 0 represents a complete source string</param>
        /// <param name="match">
        /// String representing match. For ID=0 is NULL as it is equal to original string. Null is returnet to limit transferring the whore source string
        /// </param>
        /// <param name="result">Result of the replacement for the matchID</param>
        public RegExRowReplace(int matchID, SqlChars match, SqlChars result)
        {
            MatchID = matchID;
            Match = match;
            Result = result;
        }

        public int MatchID;
        public SqlChars Match;
        public SqlChars Result;
    }

    #endregion

    #region RegEx functions

    /// <summary>
    /// Applies Regular Expression on the Source string, takes apropriate match and aplies e replace on it.
    /// </summary>
    /// <param name="sourceString">Source string on which the regular expression should be applied</param>
    /// <param name="pattern">Regular Expression pattern</param>
    /// <param name="replacement">Replacement to be used on match</param>
    /// <param name="matchId">ID of the Match to be returned. 0 replaces all matches in the sourceString</param>
    /// <returns>Value of the Group from within a Match</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlChars RegExReplace(SqlString sourceString, SqlString pattern, SqlString replacement, int matchId)
    {
        Match m = null;
        Regex r = new Regex(pattern.Value, RegexOptions.Compiled);

        if (matchId == 0)
        {
            return new SqlChars(r.Replace(sourceString.Value, replacement.Value));
        }
        if (matchId == 1)
        {
            m = r.Match(sourceString.Value);
        }
        else if (matchId > 1)
        {
            MatchCollection mc = r.Matches(sourceString.Value);
            m = mc != null && mc.Count > matchId - 1 ? mc[matchId - 1] : null;
        }

        return m != null ? new SqlChars(m.Result(replacement.Value)) : SqlChars.Null;
    }


    /// <summary>
    /// Applies Regular Expression o the Source strings and return all matches and final results after replacement
    /// </summary>
    /// <param name="sourceString">Source string on which the regular expression should be applied</param>
    /// <param name="pattern">Regular Expression pattern</param>
    /// <param name="replacement">Replacement to be used on matches</param>
    /// <returns>IEnumerable</returns>
    [SqlFunction(FillRowMethodName = "FillRegExRowReplace")]
    public static IEnumerable RegExMatchesReplace(SqlString sourceString, SqlString pattern, SqlString replacement)
    {
        Regex r = new Regex(pattern.Value, RegexOptions.Compiled);
        int matchId = 0;

        foreach (Match m in r.Matches(sourceString.Value))
        {
            yield return new RegExRowReplace(++matchId, new SqlChars(m.Value), new SqlChars(m.Result(replacement.Value)));
        }
    }

    #endregion

    #region FillRow methods

    public static void FillRegExRowReplace(object obj, out int matchID, out SqlChars match, out SqlChars result)
    {
        RegExRowReplace row = (RegExRowReplace)obj;
        matchID = row.MatchID;
        match = row.Match;
        result = row.Result;
    }

    #endregion
}

RegExReplace

This is a scalar CLR function which handles a Regular Expression replacement. It allows to replace a single match from within a string or even replace all matches inside the string if matchID = 0 is passed as an argument.

RegExMatchesReplace

In contrast to RegExReplace the RegExMatchesReplace is a CLR table valued function, which returns all the matches as separated rows and also replacements. It returns a table with three columns MatchID which represents ID of the match from within the original string, Match which represents the particular RegEx match from within the string and Result, which is result of the replacement done on the Match.

So once the class is compiled into Assembly, we can register the functions inside SQL Server database.

CREATE ASSEMBLY [SQLRegEx]
AUTHORIZATION [dbo]
FROM 'C:CLRSQLRegEx.dll'
WITH PERMISSION_SET = SAFE

CREATE FUNCTION [dbo].[fn_RegExReplace](
    @sourceString [nvarchar](max),        --Source string to apply RegEx
    @pattern [nvarchar](4000),            --RegEx pattern to apply on the source string
    @replacement [nvarchar](400),        --Replacement pattern to apply on match
    @matchID [int] = 1                    --ID of the match to return. @matchID = 1 is the first match. IF @matchID=0 then all the patterns in the @sourceString are replaced
)        
RETURNS [nvarchar](4000)
WITH EXECUTE AS CALLER
AS 
EXTERNAL NAME [SQLRegEx].[SQLRegEx].[RegExReplace]
GO


CREATE FUNCTION [dbo].[fn_RegExMatchesReplace](
    @sourceString [nvarchar](max),  --Source string to apply RegEx
    @pattern [nvarchar](4000),        --RegEx to apply on the source string
    @replacement [nvarchar](4000)    --Replacement pattern to apply on matches
)        
RETURNS  TABLE (
    [matchId] [int] NULL,                --ID of returnet match
    [match] [nvarchar](4000) NULL,        --value of match
    [result] [nvarchar](4000) NULL        --Replacement result
) WITH EXECUTE AS CALLER
AS 
EXTERNAL NAME [SQLRegEx].[SQLRegEx].[RegExMatchesReplace]
GO

Testing the Replace method

Here are some sample queries so you can imagine how the functions work.

Replace year in all Dates with year 2011

WITH StringDates AS (
    SELECT '2010/09/3' AS StringDate UNION ALL
    SELECT '2010/07/09' UNION ALL
    SELECT '2009/1/15'
)
SELECT
    StringDate,
    dbo.fn_RegExReplace(StringDate, '(d{4})/(d{1,2})/(d{1,2})', '2011/$2/$3', 0)
FROM StringDates

Remove the values from the Key=Value pairs and leave only the Key values separated by semicolons

SELECT dbo.fn_RegExReplace(     'Key1=Val1;Key2=Val2;Key3=Val3;Key4=Val4',     's*?(w+?)s*?=s*?(.*?)(;|$)',     '$1$3',     0)

Get list of all dates from within a string and replace the year of each date by 2011

SELECT
*
FROM dbo.fn_RegExMatchesReplace(
    '2010/09/3;2010/07/09;2009/1/15', 
    '(d{4})/(d{1,2})/(d{1,2})', 
    '2011/$2/$3'
    )

From the examples we can see, that CLR give us a real power to using Regular Expressions on SQL Server. It allows us to make some complex transformation which could not be possible to do using a pure T-SQL and in case of transformation which can be done by pure T-SQL, mostly the CLR RegEx solution is much faster and straightforward.

Update to Querying Active Directory on SQL Server using CLR

2011-04-21 Pavel Pawlowski CLR, SQL Server Active Directory, CLR, Query, SQL Server

I’ve just posted an update to my article Querying Active Directory on SQL Server using CLR. I’ve added another 2 parameters to the samples which one of them allows control the Page Size limit for paged AD Search and the second one allows limit the number of rows returned to the client. This can be handful when the AD query can return more results than we need. In such case it’s easy to limit the number of rows.

I’ve updated the article after the comment from Chris, when he received an insufficient memory error when he wanted to return a higher number of fields from AD.

In this case the lowering the page size from 1000 to lower values should avoid this kind of problems.

Querying Active Directory on SQL Server using CLR

2011-04-13 Pavel Pawlowski CLR, SQL Server Active Directory, CLR, Query, SQL Server

In my previous article Querying Active Directory on SQL Server using T-SQL I was showing the possibilities of querying the Active Directory using T-SQL, specifically using linked server and OPENQUERY statement or without linked server using the OPENROWSET statement.

This is an updated article and includes modification coming upon some of the comments to resolve some issues with large number of returned AD properties.

There ware mentioned some limitations of querying the AD using T-SQL and CLR will help us to bypass those limitations. Especially the limit of 1000 records returned as we can use pagination in the CLR code effectively.

For the purpose I’m creating a CLR Stored Procedure which will take several argument and return a result set. The reason I’m going to use a CLR Stored Procedure is, that stored procedure will allow me to return dynamic result set. It means I can specify properties of AD to return and those will be returned as columns of the result set. In case we go through the CLR Table Valued function, we had to create a separate function for each properties combination we would like to return.

So here is the CLR class for querying AD.

public class ActiveDirectory
{
    /// <summary>
    /// Queries Active directory according provided parameters
    /// Current user credentials are used for authentication
    /// </summary>
    /// <param name="adRoot">AD Root for querying AD</param>
    /// <param name="filter">Filter to be used for querying</param>
    /// <param name="searchScope">Scope to be used for queryingg</param>
    /// <param name="propertiesToLoad">List of properties to return</param>
    /// <param name="pageSize">Represents a PageSise for the paged search of AD</param>
    /// <param name="rowsLimit">Rrepresent limit for numbers of rows returned. NULL or value less than 1 represents unlimited</param>
    [Microsoft.SqlServer.Server.SqlProcedure]
    public static void QueryAD(string adRoot, string filter, string propertiesToLoad, string searchScope, int pageSize, SqlInt32 rowsLimit)
    {
        SearchAD(null, null, null, adRoot, filter, searchScope, propertiesToLoad, pageSize, rowsLimit);
    }

    /// <summary>
    /// Queries Active directory according provided parameters
    /// </summary>
    /// <param name="userName">UserName to be used to authenticate AD</param>
    /// <param name="password">Password to be used to authenticate to AD</param>
    /// <param name="adRoot">AD Root for querying AD</param>
    /// <param name="filter">Filter to be used for querying</param>
    /// <param name="searchScope">Scope to be used for queryingg</param>
    /// <param name="propertiesToLoad">List of properties to return</param>
    /// <param name="pageSize">Represents a PageSise for the paged search of AD</param>
    /// <param name="rowsLimit">Rrepresent limit for numbers of rows returned. NULL or value less than 1 represents unlimited</param>
    [Microsoft.SqlServer.Server.SqlProcedure]
    public static void QueryADUName(string userName, string password, string adRoot, string filter, string propertiesToLoad, string searchScope, int pageSize, SqlInt32 rowsLimit)
    {
        SearchAD(userName, password, null, adRoot, filter, searchScope, propertiesToLoad, pageSize, rowsLimit);
    }

    /// <summary>
    /// Queries Active directory according provided parameters
    /// </summary>
    /// <param name="userName">UserName to be used to authenticate AD</param>
    /// <param name="password">Password to be used to authenticate to AD</param>
    /// <param name="authType">Authentication type to be used to authenticate to AD</param>
    /// <param name="adRoot">AD Root for querying AD</param>
    /// <param name="filter">Filter to be used for querying</param>
    /// <param name="searchScope">Scope to be used for queryingg</param>
    /// <param name="propertiesToLoad">List of properties to return</param>
    /// <param name="pageSize">Represents a PageSise for the paged search of AD</param>
    /// <param name="rowsLimit">Rrepresent limit for numbers of rows returned. NULL or value less than 1 represents unlimited</param>
    [Microsoft.SqlServer.Server.SqlProcedure]
    public static void QueryADAuth(string userName, string password, string authType, string adRoot, string filter, string propertiesToLoad, string searchScope, int pageSize, SqlInt32 rowsLimit)
    {
        SearchAD(userName, password, authType, adRoot, filter, searchScope, propertiesToLoad, pageSize, rowsLimit);
    }

    private static DirectoryEntry GetRootEntry(string adRoot, string userName, string password, string authType)
    {
        if (userName == null)
            return new DirectoryEntry(adRoot);
        else if (authType == null)
            return new DirectoryEntry(adRoot, userName, password);
        else
        {
            AuthenticationTypes at;
            if (TryParseEnum<AuthenticationTypes>(authType, true, out at))
                return new DirectoryEntry(adRoot, userName, password, at);
            else
                throw new System.InvalidCastException(string.Format("authType must be one of '{0}'", GetEnumNames<AuthenticationTypes>()));
        }
    }

    private static string GetEnumNames<T>()
    {
        string[] names = Enum.GetNames(typeof(T));
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < names.Length; i++)
        {
            if (i > 0)
                sb.Append(',');
            sb.Append(names[i]);
        }
        return sb.ToString();
    }

    private static bool TryParseEnum<T>(string value, bool ignoreCase, out T outEnum)
    {
        try
        {
            outEnum = (T)Enum.Parse(typeof(T), value, ignoreCase);
            return true;
        }
        catch
        {
            outEnum = (T)Enum.GetValues(typeof(T)).GetValue(0);
            return false;
        }
    }

    /// <summary>
    /// Searches Active Directory according provided parameters
    /// </summary>
    /// <param name="userName">UserName to be used to authenticate AD</param>
    /// <param name="password">Password to be used to authenticate to AD</param>
    /// <param name="authType">Authentication type to be used to authenticate to AD</param>
    /// <param name="adRoot">AD Root for querying AD</param>
    /// <param name="filter">Filter to be used for querying</param>
    /// <param name="searchScope">Scope to be used for queryingg</param>
    /// <param name="propertiesToLoad">List of properties to return</param>
    /// <param name="pageSize">Represents a PageSise for the paged search of AD</param>
    /// <param name="rowsLimit">Rrepresent limit for numbers of rows returned. NULL or value less than 1 represents unlimited</param>
    private static void SearchAD(string userName, string password, string authType, string adRoot, string filter, string searchScope, string propertiesToLoad, int pageSize, SqlInt32 rowsLimit)
    {
        string[] properties = propertiesToLoad.Split(new char[] { ',' }, StringSplitOptions.RemoveEmptyEntries);
        SqlMetaData[] recordMetaData = new SqlMetaData[properties.Length];
        SearchScope scope;

        int limit = rowsLimit.IsNull ? 0 : rowsLimit.Value;
        int rowsCount = 0;

        if (rowsLimit > 0 && pageSize > limit)
            pageSize = limit;

        if (!TryParseEnum<SearchScope>(searchScope, true, out scope))
            throw new System.InvalidCastException(string.Format("searchScope must be one of '{0}'", GetEnumNames<SearchScope>()));

        //Trim properties and prepare result set metadata, also process specified lengths
        for (int i = 0; i < properties.Length; i++)
        {
            string[] propDetails = properties[i].Split(new char[] { ',' }, StringSplitOptions.RemoveEmptyEntries);
            string propName = propDetails[0].Trim();
            int len;

            if (propDetails.Length < 2 || !int.TryParse(propDetails[1], out len) || len < 1 || len > 4000)
                len = 4000;

            recordMetaData[i] = new SqlMetaData(properties[i], System.Data.SqlDbType.NVarChar, len);
            properties[i] = propName;
        }

        //Get Root Directory Entry
        using (DirectoryEntry rootEntry = GetRootEntry(adRoot, userName, password, authType))
        {
            //Create a directory searcher with aproperiate filter, properties and search scope
            using (DirectorySearcher ds = new DirectorySearcher(rootEntry, filter, properties, scope))
            {
                ds.PageSize = pageSize; //Set Page Size - without this we will not do a paged search and we will be limiited to 1000 results

                //find all object from the rood, according the filter and search scope
                using (SearchResultCollection results = ds.FindAll())
                {
                    SqlDataRecord record = new SqlDataRecord(recordMetaData);
                    //Start pushing of records to client
                    SqlContext.Pipe.SendResultsStart(record);

                    foreach (SearchResult result in results)
                    {
                        record = new SqlDataRecord(recordMetaData);

                        for (int i = 0; i < properties.Length; i++)
                        {
                            ResultPropertyValueCollection props = result.Properties[properties[i]];

                            if (props.Count == 1)           //if property collection contains single vallue, set the record field to that value
                                record.SetSqlString(i, props[0].ToString());
                            else if (props.Count == 0)      //if property collection doesn't contain any value, set record field to NULL
                                record.SetSqlString(i, SqlString.Null);
                            else                            //In case of multiple value, separate the values by commas
                            {
                                StringBuilder sb = new StringBuilder();
                                bool firstItem = true;
                                foreach (object prop in props)
                                {
                                    if (!firstItem)
                                        sb.Append(',');
                                    else
                                        firstItem = false;

                                    sb.Append(prop.ToString());
                                }

                                record.SetSqlString(i, sb.ToString());
                            }
                        }

                        //send record to client
                        SqlContext.Pipe.SendResultsRow(record);

                        //if rowsLimit was reached, break the loop
                        if (++rowsCount == rowsLimit)
                            break;
                    }

                    //stop sending records to client
                    SqlContext.Pipe.SendResultsEnd();
                }
            }
        }
    }
}

The class contains three public methods (CLR Stored Procedures) QueryAD, QueryADUName and QueryADAuth. The first one will query AD using current user credentials and default authentication method, second will query AD using provided user credentials and default authentication method and in the third one we can specify also authentication method.

Those methods call a private method SearchAD, which takes care about the AD Searching and return the returns the result to client.

After the comment from Chris, I’ve updated my sample codes to have also the optional parameter pageSize which allow reduce the size of a Page for Paged Search used for querying the AD and avoid insufficient memory problems when querying higher amount of AD attributes.

I’ve also added a parameter rowsLimit which allows limit the maximum number of rows returned, which can be useful especially when querying very large AD. There is added a break into the loop which iterates the results from AD when the number of precessed imtes reach the limit.

UPDATE:

Finally I’ve also updated the private static void SearchAD method so now you pass the properties not as a comma separated list but a semicolon separated list. You can use a comma to specify the return length of each property. If the length is not specified, then the method will use a default 4000 characters length. This update should finally solve the issues with large number of properties returned when previously all were returned as nvarchar(4000)

The updated part of the code is the for loop which is processing properties and generating result set metadata, starting with the comment //Trim properties and prepare result set metadata, also process specified lengths

END OF UPDATE:

To be able to compile the code for example using Visual C# Express it is necessary to add reference to the System.DirectoryServices assembly in the project.

Once we compile the code and create say ADServices.dll assembly, we can register that assembly and CLR Stored procedures in our database.

Because the ActiveDirectory class is using System.DirectoryServices assembly, we will have to use UNSAFE PERMISSION_SET for our assembly and so the database using that assembly has to be TRUSTWORTHY.

ALTER DATABASE TestDB3 SET TRUSTWORTHY ON WITH ROLLBACK IMMEDIATE;

As we are using the System.DirectoryServices assembly, we have to register it in our database prior registering our assembly, otherwise we will not be able to register it.

--Register the NetFramework System.DirectoryServices assembly
CREATE ASSEMBLY [System.DirectoryServices]
AUTHORIZATION [dbo]
FROM 'C:WindowsMicrosoft.NETFrameworkv2.0.50727System.DirectoryServices.dll'
WITH PERMISSION_SET = UNSAFE
GO
--Register our assembly with ActiveDirectory CLR Stored Procedures
CREATE ASSEMBLY [ADServices]
AUTHORIZATION [dbo]
FROM 'C:CLRADServices.dll'
WITH PERMISSION_SET = UNSAFE
GO

Once we have successfully registered those assemblies, we can create the CLR Stored procedures in our DB.

--==========================================================================
-- @adRoot - root of searching eg. 'LDAP://OU=Sales,DC=Fabricam,DC=com'
-- @filter -  filter to be used for searching eg. '(&(objectCategory=group)'
-- @propertiesToLoad - list of properties to be retrieved eg. 'cn,50;ADsPath'
--                   - properties are separated by a semicolon and
--                   - and comma can be used to specify return length of the property
-- @searchScope - scope to be used for searching: {Base,OneLevel,Subtree}
-- @pageSize - specifies the PageSize for paged search - default is 1000
--               it is possible to lower the value if there is a problem 
--             retrieving such amount of records at once.
-- @rowsLimit - represents maximum number of rows returned.
--              NULL or value less than 1 represents unlimited
--==========================================================================
CREATE PROCEDURE usp_QueryAD
    @adRoot nvarchar(255),
    @filter nvarchar(255),
    @propertiesToLoad nvarchar(255),
    @searchScope nvarchar(8),
    @pageSize int = 1000,
    @rowsLimit int = 0
AS
EXTERNAL NAME [ADServices].[ActiveDirectory].[QueryAD]
GO
--==========================================================================
-- @userName - username to be used for authentificaiton to AD
-- @password - password to be used for authentification to AD
-- @adRoot - root of searching eg. 'LDAP://OU=Sales,DC=Fabricam,DC=com'
-- @filter -  filter to be used for searching eg. '(&(objectCategory=group)'
-- @propertiesToLoad - list of properties to be retrieved eg. 'cn,50;ADsPath'
--                   - properties are separated by a semicolon and
--                   - and comma can be used to specify return length of the property
-- @searchScope - scope to be used for searching: {Base,OneLevel,Subtree}
-- @pageSize - specifies the PageSize for paged search - default is 1000
--               it is possible to lower the value if there is a problem 
--             retrieving such amount of records at once.
-- @rowsLimit - represents maximum number of rows returned.
--              NULL or value less than 1 represents unlimited
--==========================================================================
CREATE PROCEDURE usp_QueryADUname
    @userName nvarchar(255),
    @password nvarchar(255),
    @adRoot nvarchar(255),
    @filter nvarchar(255),
    @propertiesToLoad nvarchar(255),
    @searchScope nvarchar(8),
    @pageSize int = 1000,
    @rowsLimit int = 0
AS
EXTERNAL NAME [ADServices].[ActiveDirectory].[QueryADUName]
GO
--==========================================================================
-- @userName - username to be used for authentificaiton to AD
-- @password - password to be used for authentification to AD
-- @authType - Authentification Type to be used for AD Authentification
--             {None,Secure,Encryption,SecureSocketsLayer,ReadonlyServer,Anonymous,
--             FastBind,Signing,Sealing,Delegation,ServerBind}
-- @adRoot - root of searching eg. 'LDAP://OU=Sales,DC=Fabricam,DC=com'
-- @filter -  filter to be used for searching eg. '(&(objectCategory=group)'
-- @propertiesToLoad - list of properties to be retrieved eg. 'cn,50;ADsPath'
--                   - properties are separated by a semicolon and
--                   - and comma can be used to specify return length of the property
-- @searchScope - scope to be used for searching: {Base,OneLevel,Subtree}
-- @pageSize - specifies the PageSize for paged search - default is 1000
--               it is possible to lower the value if there is a problem 
--             retrieving such amount of records at once.
-- @rowsLimit - represents maximum number of rows returned.
--              NULL or value less than 1 represents unlimited
--==========================================================================
CREATE PROCEDURE usp_QueryADAuth
    @userName nvarchar(255),
    @password nvarchar(255),
    @authType nvarchar(20),
    @adRoot nvarchar(255),
    @filter nvarchar(255),
    @propertiesToLoad nvarchar(255),
    @searchScope nvarchar(8),
    @pageSize int = 1000,
    @rowsLimit int = 0
AS
EXTERNAL NAME [ADServices].[ActiveDirectory].[QueryADAuth]
GO

After successful creation of the stored procedures we can start querying the AD: The filter parameter is a LDAP filter according the Search Filter Syntax.

--Query AD Groups
--Result set will contain cn - common name, ADsPath
EXECUTE [usp_QueryAD] 
   @adRoot = 'LDAP://DC=Fabricam,DC=com'
  ,@filter = '(&(objectCategory=group))'
  ,@propertiesToLoad = 'cn,50;ADsPath'
  ,@searchScope = 'subtree'
GO

--Query users of Sales organization unit
--Result set will containt sn - surname, cn - common name and ADsPath
EXECUTE [usp_QueryAD] 
   @adRoot = 'LDAP://OU=Sales,DC=Fabricam,DC=com'
  ,@filter = '(&(objectCategory=user)(objectClass=user))'
  ,@propertiesToLoad = 'sn,cn,ADsPath'
  ,@searchScope = 'subtree'
GO

--Query al users belonging to Group "Sales Representatives" and providing user name and password 
--for AD Authetication including authetication type
--Result set will containt sn - surname, cn - common name and ADsPath
EXECUTE [TestDB3].[dbo].[usp_QueryADAuth] 
   @userName = 'FabricamADUserName'
  ,@password = 'ADuserNamePassword'
  ,@authType = 'secure'
  ,@adRoot = 'LDAP://DC=Fabricam,DC=com'
  ,@filter = '(&(objectCategory=user)(objectClass=user)(memberOf=CN=Sales Representatives,OU=Security Groups,OU=Groups,DC=Fabricam,DC=com))'
  ,@propertiesToLoad = 'sn,100;cn,200;ADsPath,1000'
  ,@searchScope = 'subtree'
GO

As we use a stored procedure for querying AD, then we cannot work directly with the result further (OK.. On Denali it will be possible thanks to the EXECUTE WITH RESULT SETS. On SQL Server 2005 and 2008 we could store the results e.g. to table variable to temp table and then work with the results as normally. From the CLR code we can see, that the result set contains all the AD properties we have passed as parameter and the order is exactly the same as in the input parameter. The data type of each returned column is nvarchar with length which was specified in the properties list. If no length was specified or the length was less than one or grater than 4000 then the return type is nvarchar(4000).

--Query All users from AD, store them in a table variable and then
--select all users with their cn (common name) starting by "C"
DECLARE @result TABLE (
    sn nvarchar(100),
    cn nvarchar(200),
    ADsPath nvarchar(1000)
)

INSERT INTO @result
EXECUTE [TestDB3].[dbo].[usp_QueryADAuth] 
   @userName = 'FabricamADUserName'
  ,@password = 'ADuserNamePassword'
  ,@authType = 'secure'
  ,@adRoot = 'LDAP://DC=Fabricam,DC=com'
  ,@filter = '(&(objectCategory=user)(objectClass=user))'
  ,@propertiesToLoad = 'sn,100;cn,200;ADsPath,1000'
  ,@searchScope = 'subtree'

SELECT 
*
FROM @result
WHERE cn LIKE 'C%'

From the examples above we can see, that once we create CLR stored procedures for querying the AD, the queries to AD are quite easy. Comparing the the T-SQL and Linked Server or OPENROWSET solution we have much greater possibilities and what is most important, we are not limited to 1000 results from our query, so we can easily query all the object in AD.

If you are interested, you can download sample ADServices Visual C# 2010 Express project. It contains all the latest updates I have mentioned.

Here you can download an updated sample PPSqlDirectoryServices Visual C# 2010 project with all the lasted updates implemented.

Reading SharePoint List and Survey data using Integration Services package

2011-04-08 Pavel Pawlowski CLR, SQL Server, SSIS CLR, SharePoint, SQL Server, SSIS

Sometimes you can come to situation when you need to read data from SharePoint and store them in database on SQL Server.

If you need to read a SharePoint List, then you can use SharePoint List Source and Destination components from CodePlex. There is a quite good presentation how to use it on MSDN site Extracting and Loading SharePoint Data in SQL Server Integration Services

These components work very well for the SharePoint List. Unfortunately you will have no success using them for reading data from the SharePoint Survey, although the structure of the data are nearly the same as for list. Simply these components doesn’t support Survey data. Also it can happen, that you cannot install any third party component you your server and therefore you cannot use the SharePoint List Source and Destination components.

In that case you can use a script which will extract the XML data out of the SharePoint and you have two possibilities how to handle the XML.

You can save the XML file to some temporary location and then process the XML file using the XML Source in the Data Flow task to read data from it.
You can process the XML inside the Script Component in the Data Flow task and direct records to corresponding output.

To retrieve the data you can use a below class snippet (C#), which you put into your Script Component or Scrip Task.

internal class SharePointDownloader
{
    public static XmlDocument GetSharePointListXml(string listUrl)
    {
        ServicePointManager.ServerCertificateValidationCallback = new RemoteCertificateValidationCallback(CertificateOverride.RemoteCertificateValidationCallback);
        WebResponse response;
        HttpWebRequest request = (HttpWebRequest)System.Net.HttpWebRequest.Create(listUrl);
        CookieContainer cookieContainer = new CookieContainer();

        request.CookieContainer = cookieContainer;
        request.Credentials = System.Net.CredentialCache.DefaultNetworkCredentials;

        response = request.GetResponse();
        string html;
        using (StreamReader sr = new StreamReader(response.GetResponseStream()))
        {
            html = sr.ReadToEnd();
            sr.Close();
        }

        XmlDocument doc = new XmlDocument();
        doc.LoadXml(html);

        XmlDocument destXml = new XmlDocument();
        XmlDeclaration dec = destXml.CreateXmlDeclaration("1.0", null, null);
        destXml.AppendChild(dec);
        XmlNode data = destXml.CreateElement("Data");
        destXml.AppendChild(data);

        foreach (XmlNode node in doc.GetElementsByTagName("z:row"))
        {
            XmlNode row = destXml.CreateElement("row");
            foreach (XmlAttribute attr in node.Attributes)
            {
                XmlAttribute da = destXml.CreateAttribute(attr.Name);
                da.Value = attr.Value;
                row.Attributes.Append(da);
            }
            data.AppendChild(row);
        }

        return destXml;
    }

    internal class CertificateOverride
    {
        public static bool RemoteCertificateValidationCallback(object sender, X509Certificate certificate, X509Chain chaing, SslPolicyErrors sslPolicyErrors)
        {
            return true;
        }
    }
}

As mentioned above put this class as part of the Script Component or Script Task and use the static method SharePointDownloader.GetSharePointListXml method to retrieve the XML data.

The function takes as argument an URL which points to the list/survey you need to download. You can retrieve this link from the .igy file which you get, when you select Export To Spreadsheet action in the SharePoint. Instead opening the file in excel you choose to save it and retrieve the URL from this file.

It is necessary to mention the assignation of the ServerCertificateValidationCallback property of the ServicePointManager class to a newly created internal class, which simply returns true. This is necessary to avoid problems with certificates.

The code snippet also uses the DefaultNetworkCredentials which represents credentials of the account under which the SSIS package will be executed. You can eventually provide other credentials if necessary.

One very important thing related to this snipped is, that in case of downloading Survey data, the account under which the code will be executed needs admin rights to the Survey in SharePoint, otherwise it will have no access to the complete list of responses in the survey.

So finally a complete sample of a Script Task could look similar to below code:

[System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
{

    #region VSTA generated code
    enum ScriptResults
    {
        Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
        Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
    };

    #endregion

    public void Main()
    {

        string url = "source data URL retrieved from the .igy file of the list/survey";
        string destFile = "c:tempfile.xml";
        try
        {
            XmlDocument data = SharePointDownloader.GetSharePointListXml(url);
            data.Save(destFile);
        }
        catch
        {
            Dts.TaskResult = (int)ScriptResults.Failure;
            return;
        }
        Dts.TaskResult = (int)ScriptResults.Success;
    }
}

internal class SharePointDownloader
{
    public static XmlDocument GetSharePointListXml(string listUrl)
    {
        ServicePointManager.ServerCertificateValidationCallback = new RemoteCertificateValidationCallback(CertificateOverride.RemoteCertificateValidationCallback);
        WebResponse response;
        HttpWebRequest request = (HttpWebRequest)System.Net.HttpWebRequest.Create(listUrl);
        CookieContainer cookieContainer = new CookieContainer();

        request.CookieContainer = cookieContainer;
        request.Credentials = System.Net.CredentialCache.DefaultNetworkCredentials;

        response = request.GetResponse();
        string html;
        using (StreamReader sr = new StreamReader(response.GetResponseStream()))
        {
            html = sr.ReadToEnd();
            sr.Close();
        }

        XmlDocument doc = new XmlDocument();
        doc.LoadXml(html);

        XmlDocument destXml = new XmlDocument();
        XmlDeclaration dec = destXml.CreateXmlDeclaration("1.0", null, null);
        destXml.AppendChild(dec);
        XmlNode data = destXml.CreateElement("Data");
        destXml.AppendChild(data);

        foreach (XmlNode node in doc.GetElementsByTagName("z:row"))
        {
            XmlNode row = destXml.CreateElement("row");
            foreach (XmlAttribute attr in node.Attributes)
            {
                XmlAttribute da = destXml.CreateAttribute(attr.Name);
                da.Value = attr.Value;
                row.Attributes.Append(da);
            }
            data.AppendChild(row);
        }

        return destXml;
    }

    internal class CertificateOverride
    {
        public static bool RemoteCertificateValidationCallback(object sender, X509Certificate certificate, X509Chain chaing, SslPolicyErrors sslPolicyErrors)
        {
            return true;
        }
    }
}

Once you save the data to a temporary destination, you can connect to the xml file using the XML Source in the Data Flow Task. What you will need is to generate the XSD schema from the XML source and keep that XSD schema for future use as in case of any other XML file which needs to be imported by SSIS.

If you would like to process the XML in the Script component, again you simply put the code inside the Script component, retrieve the XML and then go through the XML programmatically and direct the data to appropriate output of the script component.

Of course you can make some fine tuning for the script component to avoid double processing of the file use the core of the GetSharePointListXML and redirect the data directly to and output of the script component when processing the nodes in that part.

Fuzzy strings matching using Levenshtein algorithm on SQL Server (T-SQL vs CLR)

2010-12-28 Pavel Pawlowski CLR, SQL Server, T-SQL algorithm, CLR, Comparison, Levenshtein, SQL Server, T-SQL

Levenshtein algorithm is one of possible fuzzy strings matching algorithm. Levenshtein algorithm calculates Levenshtein distance which is a metric for measuring a difference between two strings. The Levenshtein distance is also called an edit distance and it defines minimum single character edits (insert/updates/deletes) needed to transform one string to another. Details on the algorithm itself can be found on Wikipedia.
When you need to use it in queries, functions or stored procedures you have two possibilities – T-SQL implementation and CLR implementation.
I will show both solutions here and also compare the speed of both solutions.

T-SQL implementation of Levenshtein algorithm

For T-SQL I will took the one I have found on the SqlTeam.com forums a which was originally developed by Joseph Gama as mentioned in the SqlTeam post.

CREATE FUNCTION [dbo].[edit_distance](
  @s1 nvarchar(3999),
  @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
  DECLARE @s1_len int, @s2_len int
  DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
  DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)
  SELECT
    @s1_len = LEN(@s1),
    @s2_len = LEN(@s2),
    @cv1 = 0x0000,
    @j = 1, @i = 1, @c = 0

  WHILE @j <= @s2_len
    SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
  WHILE @i <= @s1_len

  BEGIN
    SELECT
      @s1_char = SUBSTRING(@s1, @i, 1),
      @c = @i,
      @cv0 = CAST(@i AS binary(2)),
      @j = 1

    WHILE @j <= @s2_len
    BEGIN
      SET @c = @c + 1
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
        CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
      IF @c > @c_temp SET @c = @c_temp
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
      IF @c > @c_temp SET @c = @c_temp
      SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
    END
    SELECT
      @cv1 = @cv0,
      @i = @i + 1
  END

  RETURN @c
END

CLR implementation of Levenshtein algorithm

public class FuzzyStrings{
  /// <summary>
  /// Calculates the Levenshtein Distance between two strings.
  /// It is minimum of single character insert/delete/update operations needed to transfrom
  /// first string into the second string
  /// </summary>
  /// <param name="firstString">First string to calculate the distance</param>
  /// <param name="secondString">Second string to calculate the distance</param>
  /// <param name="ignoreCase">Specifies whether to ignore case in comparison</param>
  /// <returns>int represending the Levenshtein Distance</returns>
  public static int LevenshteinDistance(SqlString firstString, SqlString secondString, SqlBoolean ignoreCase)
  {
    string strF = ignoreCase ? firstString.Value.ToLower() : firstString.Value;
    string strS = ignoreCase ? secondString.Value.ToLower() : secondString.Value;
    int lenF = strF.Length;
    int lenS = strS.Length;
    int[,] d = new int[lenF + 1, lenS + 1];

    for (int i = 0; i <= lenF; i++)
      d[i, 0] = i;
    for (int j = 0; j <= lenS; j++)
      d[0, j] = j;

    for (int j = 1; j <= lenS; j++)
    {
      for (int i = 1; i <= lenF; i++)
      {
        if (strF[i - 1] == strS[j - 1])
          d[i, j] = d[i - 1, j - 1];
        else
          d[i, j] = Math.Min(Math.Min(
            d[i - 1, j] + 1,        // a deletion
            d[i, j - 1] + 1),       //an Insertion
            d[i - 1, j - 1] + 1);   // a substitution
      }
    }

    return d[lenF, lenS];
  }
}

You need to compile the code using e.g. C# Express into a an assembly and create the assembly in DB. In my Case the assembly is named [SQLCLR].

CREATE FUNCTION [dbo].[fn_LevenshteinDistance](
  @firstString [nvarchar](4000),
  @secondString [nvarchar](4000),
  @ingoreCase [bit] = 1
)
RETURNS [int]
WITH EXECUTE AS CALLER
AS EXTERNAL NAME [CLRSQL].[FuzzyStrings].[LevenshteinDistance]
GO

Testing of the functions

Once we have properly created both functions we can start testing it. Here is a script for test of both functions on several strings.

SELECT
  dbo.edit_distance('Sunday', 'Monday') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('Sunday', 'Monday', 1) AS CLRDistance

UNION ALL 

SELECT
  dbo.edit_distance('Sunday', 'Sunday') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('Sunday', 'Sunday', 0) AS CLRDistance

UNION ALL

SELECT
  dbo.edit_distance('Sunday', 'sunday') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('Sunday', 'sunday', 0) AS CLRDistance

UNION ALL

SELECT
  dbo.edit_distance('Saturday', 'Monday') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('Saturday', 'Monday', 1) AS CLRDistance

UNION ALL

SELECT
  dbo.edit_distance('This is a first string to Compare', 'This is a second string to Compare') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('This is a first string to Compare', 'This is a second string to Compare', 1) AS CLRDistance

And here are the results of the tests:

TSQLDistance CLRDistance
------------ -----------
2            2
0            0
0            1
5            5
6            6

As we can see, the functions return the same results except the second select. As we have selected not to ignore the case in the CLR version. For T-SQL strings are identical as the function uses a default collation which I have Case Insensitive. If I had a default collation Case Sensitive, then the results will be the same. It could be also possible to modify the T-SQL function to accept parameter for Case Sensitive/Insensitive comparison and then use different collations for that, but it’s not what we want to do here.

Speed comparison

As we saw in previous paragraphs here, both T-SQL and CLR version of the algorithm woks correctly. Now take a look on the calculation speed of the Levenshtein distance by both version.

For the test we can use a simple script, which will calculate the Levenshtein distance in cycle 10 000 times

DECLARE
   @TSQLStartTime datetime,
  @TSQLEndTime datetime,
  @CLRStartTime datetime,
  @CLREndTime datetime,
  @distance int,
  @i int

SELECT
  @i = 0,
  @TSQLStartTime = GETDATE();

WHILE (@i < 10000)
BEGIN
  SELECT
    @distance = dbo.edit_distance('This is a first string to Compare', 'This is a second string to compare'),
    @i = @i + 1
END

SELECT
  @TSQLEndTime = GETDATE(),
  @i = 0,
  @CLRStartTime = GETDATE()

WHILE (@i < 10000)
BEGIN
  SELECT
    @distance = [ClrSafe].fn_LevenshteinDistance('This is a first string to Compare', 'This is a second string to compare', 1),
    @i = @i + 1
END

SELECT @CLREndTime = GETDATE()

SELECT
  DATEDIFF(millisecond, @TSQLStartTime, @TSQLEndTime) AS TSQLDuration,
  DATEDIFF(millisecond, @CLRStartTime, @CLREndTime) AS CLRDuration

Here are the results

TSQLDuration CLRDuration
------------ -----------
22993        763

As we can see the 10 000 times calculation using the T-SQL version took 22993 milliseconds which is in average circa 2.3 millisecond for calculating the distance for the strings in our test query.

On the other side the 10 000 times calculation using CLR took only 763 milliseconds which is in average circa 0.08 milliseconds for calculating the distance for the same strings as in T-SQL version.

Conclusion

From the results we can see that the CLR is about 30 times faster on the same machine than the T-SQL version of the same algorithm. Even the T-SQL version took only 2.3 milliseconds per calculation of sample texts and it’s quite good to use in normal usage, the use or CLR can enormously decrease the processing time when processing higher volume of records.

Also it is another example where CLR beats the T-SQL solution and where it has sense to use the CLR instead of pure T-SQL.

Fastest CSV strings splitting using CLR (T-SQL vs. CLR revisited)

2010-10-06 Pavel Pawlowski CLR, T-SQL CLR, RegEx, SQL Server, String splitting, T-SQL

In one my previous blog post SQL Server – String splitting (T-SQL vs. CLR) I was comparing speed of T-SQL solution for string splitting vs. CLR RegEx solution. Although the CLR RegEx is fast enough, it isn’t the fastest solution for simple CSV string splitting. Also I will again compare it to the T-SQL solution.

In the mentioned post the T-SQL solution wasn’t usable for larger amount of CSV values, but after some investigations by Jeff Moden in post http://www.sqlservercentral.com/Forums/FindPost997236.aspx found a mistake we both made in the T-SQL Testing, and therefore I will post here also updated comparison to the T-SQL version

Fastest CLR version

Probably fastest CLR version for splitting sting is a CLR table-valued function which processes the whole string on character level and on a delimiter sends the results.

Here is one possible solution.

public class StringSplit
{
    private struct StrRow
    {
        public StrRow(int rowId, SqlChars value)
        {
            RowId = rowId;
            Value = value;
        }

        public int RowId;
        public SqlChars Value;

    }

    [SqlFunction(FillRowMethodName = "FillSplitString3")]
    public static IEnumerable SplitString3(SqlString sourceString, string delimiter, int maxLen)
    {
        char[] buffer = new char[maxLen];
        char delim = delimiter[0];
        int rowNumber = 0;
        int chars = 0;
        char[] finalString;

        foreach (char chr in sourceString.Value)
        {
            if (chr == delim)
            {
                finalString = new char[chars];
                Array.Copy(buffer, finalString, chars);
                yield return new StrRow(++rowNumber, new SqlChars(finalString));
                chars = 0;
            }
            else
            {
                buffer[chars++] = chr;
            }
        }
        if (chars > 0)
        {
            finalString = new char[chars];
            Array.Copy(buffer, finalString, chars);
            yield return new StrRow(++rowNumber, new SqlChars(finalString));
        }

    }

    [SqlFunction(FillRowMethodName = "FillSplitString3")]
    public static IEnumerable SplitString4(SqlString sourceString, string delimiter)
    {
        StringBuilder sb = new StringBuilder();
        char delim = delimiter[0];
        int rowNumber = 0;
        foreach (char chr in sourceString.Value)
        {
            if (chr == delim)
            {
                yield return new StrRow(++rowNumber, new SqlChars(sb.ToString()));
                sb = new StringBuilder(sb.Capacity);
            }
            else
            {
                sb.Append(chr);
            }
        }
        if (sb.Length > 0)
        {
            yield return new StrRow(++rowNumber, new SqlChars(sb.ToString()));
        }

    }

    public static void FillSplitString3(object obj, out int rowId, out SqlChars value)
    {
        StrRow r = (StrRow)obj;
        rowId = r.RowId;
        value = r.Value;
    }
}

CREATE FUNCTION dbo.fn_SplitString3(
  @sourceString nvarchar(max),
  @delimiter nchar(1),
  @maxLen int
)
RETURNS  TABLE (
    RowID int NULL,
    Value nvarchar(10) NULL
) WITH EXECUTE AS CALLER
AS
EXTERNAL NAME SQLRegEx.StringSplit.SplitString3
GO

This function takes three parameters. First the source string to be split, delimiter and maxLen, which is maximum length for an item in the CSV List. It is used to allocate buffer. And e.g.. for integer values it will be 10 as positive integer will have maximum of 10 digits. It is possible to write this function also without this parameter, but I’ve added it because of speed, as it doesn’t require buffer reallocations.

I will compare the speed also to the CLR RegEx version. I will use the function mentioned in my previous post.

For CLR RegEx we will use a simple Regular expression ”d+” as it is enough for the integer values delimited by commas.

As T-SQL candidate for speed comparison I will use the latest optimized version of Tally table splitting by Jeff Moden.

CREATE FUNCTION dbo.Split8KTallyM (
    @Parameter VARCHAR(8000),
    @Delimiter VARCHAR(1)
)
RETURNS @Result TABLE (ItemNumber INT, ItemValue INT) AS
  BEGIN
 INSERT INTO @Result
        (ItemNumber, ItemValue)
 SELECT CAST(ROW_NUMBER() OVER (ORDER BY N) AS INT) AS ItemNumber,
        SUBSTRING(@Parameter,N,CHARINDEX(@Delimiter,@Parameter+@Delimiter,N)-N) AS ItemValue
   FROM dbo.Tally
  WHERE N BETWEEN 1 AND LEN(@Parameter)+1
    AND SUBSTRING(@Delimiter+@Parameter,N,1) = @Delimiter; --Notice how we find the comma
 RETURN
    END;
GO

Test data preparation

I will use as test data the same tables as in previous tests. We will use table with 10 000 rows and each will be with different length of CSV string (16 items, 100 items and 1333 items). The table definition will be only modified and the string will not be stored as nvarchar(max) but as varchar(max). The nvarchar in previous test totally degraded the T-SQL solution so it was not usable for 1333 item in SCV string.

SELECT TOP 11000
    IDENTITY(INT, 1, 1) AS N
INTO dbo.Tally
FROM sys.all_objects o1, sys.all_objects
GO

--Add Clustered Index on Tally table
ALTER TABLE dbo.Tally
    ADD CONSTRAINT PK_Tally PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
GO

--Create and populate CsvTest table (doesn't matter whether the table has Clustered index or it is simply heap)
SELECT TOP (10000) --Controls the number of rows in the test table
    ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
    (
        SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
                ( --=== This builds CSV row with a leading comma
                SELECT TOP (16) --Controls the number of CSV elements in each row
                    ','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
                FROM dbo.Tally t3      --Classic cross join pseudo-cursor
                CROSS JOIN dbo.Tally t4 --can produce row sets up 121 million.
                WHERE t1.N <> t3.N --Without this line, all rows would be the same
                FOR XML PATH('')
                )
                ,1,1,'') AS VARCHAR(8000))
                ) AS CsvParameter
INTO CsvTest
FROM dbo.Tally t1        --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t2;  --can produce row sets up 121 million.
GO

SELECT TOP (10000) --Controls the number of rows in the test table
    ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
    (
        SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
                ( --=== This builds CSV row with a leading comma
                SELECT TOP (100) --Controls the number of CSV elements in each row
                    ','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
                FROM dbo.Tally t3      --Classic cross join pseudo-cursor
                CROSS JOIN dbo.Tally t4 --can produce row sets up 121 million.
                WHERE t1.N <> t3.N --Without this line, all rows would be the same
                FOR XML PATH('')
                )
                ,1,1,'') AS VARCHAR(8000))
                ) AS CsvParameter
INTO CsvTest2
FROM dbo.Tally t1        --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t2;  --can produce row sets up 121 million.
GO

SELECT TOP (10000) --Controls the number of rows in the test table
    ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
    (
        SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
                ( --=== This builds CSV row with a leading comma
                SELECT TOP (1333) --Controls the number of CSV elements in each row
                    ','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
                FROM dbo.Tally t3      --Classic cross join pseudo-cursor
                CROSS JOIN dbo.Tally t4 --can produce row sets up 121 million.
                WHERE t1.N <> t3.N --Without this line, all rows would be the same
                FOR XML PATH('')
                )
                ,1,1,'') AS VARCHAR(8000))
                ) AS CsvParameter
INTO CsvTest3
FROM dbo.Tally t1        --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t2;  --can produce row sets up 121 million.
GO

Speed comparison

Here is a script I will use to compare the speed:

--================= 16 items ==========
GO
--CLR fn_SplitString3
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest D
CROSS APPLY dbo.fn_SplitString3(D.CsvParameter, ',', 10) V
GO
--CLR RegEx
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest D
CROSS APPLY dbo.fn_RegExMatches2(D.CsvParameter, 'd+') V
GO
--T-SQL Split8KTallyM
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.ItemNumber,
    @ItemValue = V.ItemValue
FROM dbo.CsvTest D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
GO
--================= 100 items ==========
GO
--CLR fn_SplitString3
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest2 D
CROSS APPLY dbo.fn_SplitString3(D.CsvParameter, ',', 10) V
GO
--CLR RegEx
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest2 D
CROSS APPLY dbo.fn_RegExMatches2(D.CsvParameter, 'd+') V
GO
--T-SQL Split8KTallyM
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.ItemNumber,
    @ItemValue = V.ItemValue
FROM dbo.CsvTest2 D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
GO
--================= 1333 items ==========
GO
--CLR fn_SplitString3
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest3 D
CROSS APPLY dbo.fn_SplitString3(D.CsvParameter, ',', 10) V
GO
--CLR RegEx
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest3 D
CROSS APPLY dbo.fn_RegExMatches2(D.CsvParameter, 'd+') V
GO
--T-SQL Split8KTallyM
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.ItemNumber,
    @ItemValue = V.ItemValue
FROM dbo.CsvTest3 D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
GO

And here are the results from profiler:

Results of comparison and conclusion

As we can see in the output from profiles, the new fn_SplitString3 function is unbeatable in all scenarios. While the T-SQL took 3.5 seconds for 16 items, the new CLR split function takes only 253 milliseconds. As mentioned in previous post, the CLR RegEx benefits at higher items count over 100. And in higher counts beats the T-SQL Solutions. The new fn_SplitString even on 1333 items count took only 8.2 sec.

Calculating running total for last X rows – UPDATED

2010-10-02 Pavel Pawlowski CLR, T-SQL CLR, Running total, SQL Server, T-SQL

This is another blog post about running totals and again I will use CLR and demonstrate the power of CLR in such situations. Again this post is inspired by an article on SQL Server Central – Calculate the Running Total for the last five Transactions by Divya Agrawal.

This is a kind of running totals when we want to sum only a specific number of last rows.

The CLR function will be nothing than a slightly modification of the running totals function from my previous blog post: SQL Sever and fastest running totals using CLR.

I’ve updated this blog post based on the update of mentioned previous post, after communication with Paul White and added the security check into the CLR function to detect processing of rows out of expected order. Please read the previous post for more details.

Test data preparation

I will use sample data from the article, so it is possible to compare the solutions and we will be able to see the difference.

CREATE TABLE Accounts
(
    ID int IDENTITY(1,1),
    TransactionDate datetime,
    Balance float
)
GO

INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/1/2000',100)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/2/2000',101)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/3/2000',102)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/4/2000',103)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/5/2000',104)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/6/2000',105)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/7/2000',106)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/8/2000',107)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/9/2000',108)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/10/2000',109)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/11/2000',200)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/12/2000',201)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/13/2000',202)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/14/2000',203)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/15/2000',204)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/16/2000',205)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/17/2000',206)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/18/2000',207)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/19/2000',208)
INSERT INTO Accounts(TransactionDate,Balance) VALUES ('1/20/2000',209)
GO

CLR Solution

As I mentioned above the CLR solution is only a modification of function from previous blog post. We only add a queue of last calculated running totals.

The image above displays situation when we want to make running total for last 3 rows. Our solution will calculate the running total continuously as it is done in column B, but when the requested count of 3 rows is met, it will subtract the running total 3 rows back as can be seen on the image in column B. For this we need in our function a queue of the last X running total values (In this example 3). This class also implements the security check introduced in my previously updated article related to Running Totals. The security check consist in providing the correct row numbers in the sequence in which the records should be processed and in case the records are processed out of expected order, an exception is fired.

using System;
using Microsoft.SqlServer.Server;
using System.Runtime.Remoting.Messaging;
using System.Data.SqlTypes;
using System.Collections.Generic;

/// <summary>
/// Class contains CLR scalar functions for calculation of running totals
/// </summary>
public class RunningTotalsQueue
{
    /// <summary>
    /// Storage Structure for holding actual Total and row number for security check.
    /// </summary>
    /// <typeparam name="T">Totals Data Type</typeparam>
    private class RtStorage<T> where T : struct
    {
        public T Total;
        public int RowNo;
        public Queue<T> Queue;
        public RtStorage(int queueLength)
        {
            Queue = new Queue<T>(queueLength);
        }
    }

    /// <summary>
    /// Calculates a running totals on Int (Int32) data type
    /// </summary>
    /// <param name="val">Value of current row</param>
    /// <param name="id">ID of the function in single query</param>
    /// <param name="queueLength">Length of the queue. 0 for classical running totals</param>
    /// <param name="rowNo">Specifies expected rowNo. It is for security check to ensure correctness of running totals</param>
    /// <param name="nullValue">Value to be used for NULL values</param>
    /// <param name="nullForLessRows">Specifies whether return NULL if less values than queue are summed</param>
    /// <returns>SqlInt64 representing running total</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlInt64 RunningTotalBigIntQueue(SqlInt64 val, SqlByte id, int queueLength, int rowNo, SqlInt64 nullValue, bool nullForLessRows)
    {
        string dataName = string.Format("MultiSqlRtQueue_{0}", id.IsNull ? 0 : id.Value);

        object lastSum = CallContext.GetData(dataName);

        RtStorage<SqlInt64> storage;
        if (lastSum != null)
            storage = (RtStorage<SqlInt64>)lastSum;
        else
            CallContext.SetData(dataName, storage = new RtStorage<SqlInt64>(queueLength));

        storage.RowNo++;

        if (storage.RowNo != rowNo)
            throw new System.InvalidOperationException(string.Format("Rows were processed out of expected order. Expected RowNo: {0}, received RowNo: {1}", storage.RowNo, rowNo));

        if (!val.IsNull)
            storage.Total = storage.Total.IsNull ? val : storage.Total + val;
        else
            storage.Total = storage.Total.IsNull ? nullValue : (nullValue.IsNull ? storage.Total : storage.Total + nullValue);

        var currentTotal = storage.Total;
        if (queueLength > 0 && storage.Queue.Count == queueLength)
        {
            var lastQueue = storage.Queue.Dequeue();
            currentTotal -= lastQueue.IsNull ? 0 : lastQueue;
        }
        else if (storage.Queue.Count < queueLength && nullForLessRows)
            currentTotal = SqlInt64.Null;

        if (queueLength > 0)
            storage.Queue.Enqueue(storage.Total);

        return currentTotal;
    }

}

On first call the function allocates a queue of requested queue size inside a private storage class and stores it using the CallContext for use by future calls. The function also allows returning NULL when the count of rows processed is lower than the number of rows requested to be calculated in the running total and also allows calculation of classical running totals (not queued) in case the queue length is equal to 0.

CREATE FUNCTION [dbo].[fn_RunningTotalBigIntQueue](
	@val [bigint],                  --value to be added to running total (a fiedl in the query)
	@id [tinyint],                  --id of the running total within a single query
	@queueLength [int],             --lenght of the queue
	@rowNo [int],                   --RowNumber of processed records. This is compared to espected rowNo and in case out of synchronization an exceiption is fired
	@nullValue [bigint] = NULL,     --representation of the NULL value when adding to running totals
	@nullForLesRows [bit] = 0       --specifies whether return NULL if less records than queueLenght were processed
)
RETURNS [bigint]
WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [SqlClrTotals].[RunningTotalsQueue].[RunningTotalBigIntQueue]
GO

Tests

Once we have compiled the sources, created assembly and function in DB, we can make a tests. First we will test the T-SQL solution from the article mentioned at the beginning of this blog post and then the presented CLR solution.

T-SQL solution

SET STATISTICS IO ON
GO
SELECT
    Acc.ID,CONVERT(varchar(10),TransactionDate,101) AS TransactionDate ,
    Balance,
    isnull(RunningTotal,'') AS RunningTotal
FROM Accounts Acc
LEFT OUTER JOIN (SELECT ID,sum(Balance) AS RunningTotal
        FROM (SELECT A.ID AS ID,B.ID AS BID, B.Balance
            FROM Accounts A
                cross JOIN Accounts B
            WHERE B.ID BETWEEN A.ID-4
            AND A.ID AND A.ID>4
            ) T
        GROUP BY ID ) Bal
    ON Acc.ID=Bal.ID
GO
SET STATISTICS IO OFF
GO

Results:

ID          TransactionDate Balance                RunningTotal
----------- --------------- ---------------------- ----------------------
1           01/01/2000      100                    0
2           01/02/2000      101                    0
3           01/03/2000      102                    0
4           01/04/2000      103                    0
5           01/05/2000      104                    510
6           01/06/2000      105                    515
7           01/07/2000      106                    520
8           01/08/2000      107                    525
9           01/09/2000      108                    530
10          01/10/2000      109                    535
11          01/11/2000      200                    630
12          01/12/2000      201                    725
13          01/13/2000      202                    820
14          01/14/2000      203                    915
15          01/15/2000      204                    1010
16          01/16/2000      205                    1015
17          01/17/2000      206                    1020
18          01/18/2000      207                    1025
19          01/19/2000      208                    1030
20          01/20/2000      209                    1035 

(20 row(s) affected) 

Table 'Accounts'. Scan count 37, logical reads 37, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

We cab see, that for only a few records and the T-SQL solution produced 37 logical reads and the query plan is quite complex. And there is a lot of table scans.

If we add a clustered primary key on the ID

ALTER TABLE dbo.Accounts ADD CONSTRAINT PK_Accounts PRIMARY KEY CLUSTERED(ID)

Then the plan will be better as we will replace the three table scans in the query plan by one clustered index scan and two clustered index seeks. But even now the plan is quite complex.

CLR Solution test

SET STATISTICS IO ON
GO
SELECT
    Acc.ID
    ,CONVERT(varchar(10),TransactionDate,101) AS TransactionDate
    ,Balance
    ,ISNULL(dbo.fn_RunningTotalBigIntQueue(Balance, 0, 5, ROW_NUMBER() OVER(ORDER BY ID), NULL, 1), 0) AS RunningTotal
 FROM Accounts Acc
 ORDER BY ID
 OPTION(MAXDOP 1)
GO
SET STATISTICS IO OFF
GO

Here we have results:

ID          TransactionDate Balance                RunningTotal
----------- --------------- ---------------------- ----------------------
1           01/01/2000      100                    0
2           01/02/2000      101                    0
3           01/03/2000      102                    0
4           01/04/2000      103                    0
5           01/05/2000      104                    510
6           01/06/2000      105                    515
7           01/07/2000      106                    520
8           01/08/2000      107                    525
9           01/09/2000      108                    530
10          01/10/2000      109                    535
11          01/11/2000      200                    630
12          01/12/2000      201                    725
13          01/13/2000      202                    820
14          01/14/2000      203                    915
15          01/15/2000      204                    1010
16          01/16/2000      205                    1015
17          01/17/2000      206                    1020
18          01/18/2000      207                    1025
19          01/19/2000      208                    1030
20          01/20/2000      209                    1035 

(20 row(s) affected) 

Table 'Accounts'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Here we can see, that the CLR solution produced exactly the same output as the T-SQL solution. What more, we have only 1 logical read as the table is small and we have also received a very simple execution plan with single Table Scan and Sort. The segment and Sequence Project(Compute Scalar) operators are the ROW_NUMBER() function for security check. The compute Scalar is our function computing the running totals.

In case of the Clustered Primary Key the plan will contain one Clustered Index scan without sort as the Clustered Key is already in our expected order.

Conclusion

As we can see from the simple test here, the CLR solution is unbeatable for such purposes. Even if we use other T-SQL solution like “quirky updates”, even then the CLR will have less reads and should be quicker.

When increasing the number of records and increasing the last X number, then with higher number count, the T-SQL solution will kill even high end servers. On the other side you will see nearly no difference in speed when using the CLR solution. As it only needs to allocate one queue of last X elements. Even if we want to count last 1000 000 rows, then it only needs to allocate such queue and in case of bigint it will 8 MB + some overhead of the queue.

Of course the CLR solution has the issues with processing rows in right order, but for that purposes there is the security check implemented so in case rows are processed in other than expected order an exception is thrown. But in most cases the scenario is very simple and you need to calculate the total from a simple table and in such cases this solution work without bigger problems. In case of complex queries with JOINS or other windowing functions, records can be processed into a temp table and then the running totals calculated on the temporary table.

Pavel's SQL Pages

SQL Server & other stuff related

CLR

Exporting CLR Assemblies from SQL Server back to .dll files

Removing Accent (diacritics) using CLR

Second Update to Querying Active Directory on SQL Server using CLR

SQL Server Regular Expressions – Replacements

RegExReplace

RegExMatchesReplace

Testing the Replace method

Replace year in all Dates with year 2011

Remove the values from the Key=Value pairs and leave only the Key values separated by semicolons

Get list of all dates from within a string and replace the year of each date by 2011

Update to Querying Active Directory on SQL Server using CLR

Querying Active Directory on SQL Server using CLR

Reading SharePoint List and Survey data using Integration Services package

Fuzzy strings matching using Levenshtein algorithm on SQL Server (T-SQL vs CLR)

T-SQL implementation of Levenshtein algorithm

CLR implementation of Levenshtein algorithm

Testing of the functions

Speed comparison

Conclusion

Fastest CSV strings splitting using CLR (T-SQL vs. CLR revisited)

Fastest CLR version

Test data preparation

Speed comparison

Results of comparison and conclusion

Calculating running total for last X rows – UPDATED

Test data preparation

CLR Solution

Tests

T-SQL solution

CLR Solution test

Conclusion

Share this:

Share this:

Share this:

RegExReplace

RegExMatchesReplace

Testing the Replace method

Replace year in all Dates with year 2011

Remove the values from the Key=Value pairs and leave only the Key values separated by semicolons

Get list of all dates from within a string and replace the year of each date by 2011

Share this:

Share this:

Share this:

Share this:

T-SQL implementation of Levenshtein algorithm

CLR implementation of Levenshtein algorithm

Testing of the functions

Speed comparison

Conclusion

Share this:

Fastest CLR version

Test data preparation

Speed comparison

Results of comparison and conclusion

Share this:

Test data preparation

CLR Solution

Tests

T-SQL solution

CLR Solution test

Conclusion

Share this: