To write this blog post I was inspired by a question on Ask SQL Server Central.com about finding a substring between two characters.
Some times it is necessary to process input string and retrieve some parts of them. For this purposes we can write some more complex T-SQL constructs or use the CLR functionality of SQL Server 2005 and SQL Server 2008 and write some functions for processing Regular Expressions.
Generally all the string manipulations (especially on very large string values) are much better performing using a CLR scalar and table-valued functions than their equivalent using pure T-SQL code. Also using regular expression you do not need to write specialized functions for particular processing as you simply write a regular expression and use it in the RegEx function.
For general info about writing a CLR functions read MSDN.
More info on regular expressions you can find on Regular-Expressions.info and you can find inspiration also in the Regular Expression Library.
Implementation
For such use you can find a below SQLRegEx class which contains two functions – RegExMatch with is a scalar CLR function to return particular group from within particular match and CLR table-valued function RegExMatches which return all matches and groups.
Below is a code snippet you can use to create an assembly.
public class SQLRegEx { private class RegExRow { /// <summary> /// Private class for passing matches of the RegExMatches to the FillRow method /// </summary> /// <param name=”rowId”>ID of the Row</param> /// <param name=”matchId”>ID of the Match</param> /// <param name=”groupID”>ID of the Group within the Match</param> /// <param name=”value”>Value of the particular group</param> public RegExRow(int rowId, int matchId, int groupID, string value) { RowId = rowId; MatchId = matchId; GroupID = groupID; Value = value; } public int RowId; public int MatchId; public int GroupID; public string Value; } /// <summary> /// Applies Regular Expression on the Source string and returns value of particular group from withing a specified match /// </summary> /// <param name=”sourceString”>Source string on which the regular expression should be applied</param> /// <param name=”pattern”>Regular Expression pattern</param> /// <param name=”matchId”>ID of the Match to be returned 1 inex-based</param> /// <param name=”groupId”>ID of the group from within a match to return. GroupID 0 returns complete match</param> /// <returns>Value of the Group from within a Match</returns> [SqlFunction(IsDeterministic=true)] public static SqlChars RegExMatch(string sourceString, string pattern, int matchId, int groupId) { Match m = null; Regex r = new Regex(pattern, RegexOptions.Compiled); if (matchId == 1) { m = r.Match(sourceString); } else if (matchId > 1) { MatchCollection mc = r.Matches(sourceString); m = mc != null && mc.Count > matchId – 1 ? mc[matchId - 1] : null; } return m != null && m.Groups.Count > groupId ? new SqlChars(m.Groups[groupId].Value) : SqlChars.Null; } /// <summary> /// Applies Regular Expression o the Source strings and return all matches and groups /// </summary> /// <param name=”sourceString”>Source string on which the regular expression should be applied</param> /// <param name=”pattern”>Regular Expression pattern</param> /// <returns>Returns list of RegExRows representing the group value</returns> [SqlFunction(FillRowMethodName = "FillRegExRow")] public static IEnumerable RegExMatches(string sourceString, string pattern) { Regex r = new Regex(pattern, RegexOptions.Compiled); int rowId = 0; int matchId = 0; foreach (Match m in r.Matches(sourceString)) { matchId++; for (int i = 0; i < m.Groups.Count; i++) { yield return new RegExRow(++rowId, matchId, i, m.Groups[i].Value); } } } /// <summary> /// FillRow method to populate the output table /// </summary> /// <param name=”obj”>RegExRow passed as object</param> /// <param name=”rowId”>ID or the returned row</param> /// <param name=”matchId”>ID of returned Match</param> /// <param name=”groupID”>ID of group in the Match</param> /// <param name=”value”>Value of the Group</param> public static void FillRegExRow(Object obj, out int rowId, out int matchId, out int groupID, out SqlChars value) { RegExRow r = (RegExRow)obj; rowId = r.RowId; matchId = r.MatchId; groupID = r.GroupID; value = new SqlChars(r.Value); } }
RegExMatch function
This is a scalar function which takes three arguments. First is the source string to be parsed using regular expressions. Second parameter is regular expression itself, third is a matchId (counted from 1 and which represents a match to be returned) to be returned if there could be more matches and fourth is a Group ID to be returned from within the regular expression match. When there is no group or you want to return simply the complete Match, pass 0.
RegExMatches function
This is CLR table function and it returns all matches including each group in each match. The complete Match has groupID equal to 0 in the output. Input parameters are source string to be parsed using the regular expression a second is the regular expression pattern.
Once we have a compiled assembly, we can register the assembly and function in database and start using them for various Regular Expressions processing.
CREATE ASSEMBLY [SQLRegEx] AUTHORIZATION [dbo] FROM 'C:CLRSQLRegEx.dll' WITH PERMISSION_SET = SAFE GO CREATE FUNCTION [dbo].[fn_RegExMatch]( @sourceString nvarchar(max), --Source string to be processed by regular expression @pattern nvarchar(4000), --Regular expression (pattern) to be applied on the source string @matchId int, --ID of the match to be returned (if there are more matches, you can return a concrete one). Numbering starts from 1 @groupID int --ID of the group in the match to be returned. GroupID = 0 represents a complete match. ) RETURNS nvarchar(4000) WITH EXECUTE AS CALLER AS EXTERNAL NAME [SQLRegEx].[SQLRegEx].[RegExMatch] GO CREATE FUNCTION [dbo].[fn_RegExMatches]( @sourceString nvarchar(max), --Source string to be processed by regular expression @pattern nvarchar(4000) --Regular expression (pattern) to be applied on the source string ) RETURNS TABLE ( [rowId] int, --RowId each row as it`s ID [matchId] int, --ID of particular match (starts from 1) [groupId] int, --ID of particular group in RegEx match (GroupID = 0) represents a complete match [value] nvarchar(4000) --value of the group ) WITH EXECUTE AS CALLER EXTERNAL NAME [SQLRegEx].[SQLRegEx].[RegExMatches] GO
RegExMatches expamples
Eg. Split semicolon separated list of Key=Value pairs. Pair has the same matchId and Key has groupID = 1 and value has groupID=2. Without the WHERE Condition also the GroupID 0 will be returned which contains each pair.
Split Key=Value pairs
SELECT * FROM dbo.fn_RegExMatches('Key1=Val1;Key2=Val2;Key3=Val3;Key4=Val4', 's*(w+)s*=s*?(.*?)(?=;|z)') WHERE groupId > 0
rowId matchId groupId value ———– ———– ———– —— 2 1 1 Key1 3 1 2 Val1 5 2 1 Key2 6 2 2 Val2 8 3 1 Key3 9 3 2 Val3 11 4 1 Key4
Splitting comma separated list
SELECT * FROM dbo.fn_RegExMatches('1,2,3,4,5,6,7,8', '(d+?)(?:,|z)') WHERE groupId = 1 --or even this and doesn’t matter what separator is used SELECT * FROM dbo.fn_RegExMatches('1,2,3,4,5;6,7,8', 'w+')
List of words in sentence
SELECT * FROM dbo.fn_RegExMatches('This is a sample sentence to Split.', 'w+')
RegExMatch examples
Value of the Key=Value pair
SELECT dbo.fn_RegExMatch('Key = Value' ,'s*w+s*=s*(.*)', 1, 1)
Fourth word in the string
SELECT dbo.fn_RegExMatch('This is a sample sentence to Split.', 'w+', 4, 0)
Sixth number in list
SELECT dbo.fn_RegExMatch('1,2,3,4,5;6,7,8', 'w+', 6, 0)
Email address validation
--validating email address DECLARE @emailAddress nvarchar(20) SET @emailAddress = 'name@domain.com' SELECT CASE WHEN @emailAddress = dbo.fn_RegExMatch(@emailAddress, '([a-zA-Z0-9_-.]+)@(([[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.)|(([a-zA-Z0-9-]+.)+))([a-zA-Z]{2,4}|[0-9]{1,3})', 1, 0) THEN 1 ELSE 0 END AS ValidEmail --lists all invalid email WITH emails AS ( SELECT 'name@domain.com' AS email UNION ALL SELECT 'name@subdomain.domain.com' UNION All SELECT 'name@this.email.is.wrong' UNION ALL SELECT 'another.wrong@@domain.com' ) SELECT * FROM emails WHERE email <> dbo.fn_RegExMatch(email, '([a-zA-Z0-9_-.]+)@(([[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.)|(([a-zA-Z0-9-]+.)+))([a-zA-Z]{2,4}|[0-9]{1,3})', 1, 0)
Get day from string date
WITH StringDates AS ( SELECT '2010/09/3' AS StringDate UNION ALL SELECT '2010/07/09' UNION ALL SELECT '2009/1/15' ) SELECT StringDate, dbo.fn_RegExMatch(StringDate, '(d{4})/(d{1,2})/(d{1,2})', 1, 3) AS DateDay FROM StringDates
Summary
Hope, this article will help you using regular expressions in SQL Server and you will be able to simplify and speed-up your queries.
You may be also interested in my next post SQL Server Regular Expressions – Replacements which describes possibilities of string replacements using regular expressions.