Removing Accent (diacritics) using CLR

Some times it may happen, that you need to remove accent (diacritics) from string when you are querying data or when you are storing data into particular table.

There are several methods how to achieve this. One of the possible solution is using a CONVERT with COLLATE as Twitted by MVP Mladen Prajdic and mentioned on Luke Jian blog, but this method has an issue I will mention below.

Other solution could be replacement of accented characters with the ones without accent. But this is a very problematic solution as you have to do a lot of replacement (the performance can suffer) and it will be problematic to cover all the possible accent characters and you easily miss one. Similar to this could be creation of a replacement table and implementation of function similar to Splitting function using Tally Table (Jeff Moden). But again you will have to write all possible combination of accented and not accented characters into the replacement table.

So if you do not want to change the collation and would like to avoid possible issues of the CONVERT method and avoid writing replacement tables etc., you can easily use a very simple amd known CLR method. This method is using string normalization and CharUnicodeInfo class from the System.Globalization name space. Based on this you can write a very simple scalar CLR function.

using System;
using System.Collections.Generic;
using System.Text;
using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;
using System.Globalization;

public class StringsCommon
{
    /// <summary>
    /// Removes Accent (Diacritics) from string
    /// </summary>
    /// <param name="sourceString">Source string fro wchich accent should be removed</param>
    /// <returns>string without accent</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlChars RemoveAccent(SqlString sourceString)
    {
        if (sourceString.IsNull)
            return SqlChars.Null;

        string normalized = sourceString.Value.Normalize(NormalizationForm.FormD);

        StringBuilder output = new StringBuilder(sourceString.Value.Length);

        foreach (char ch in normalized)
        {
            if (CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.NonSpacingMark)
                output.Append(ch);
        }
        return new SqlChars(output.ToString());
    }
}

Once you compile above mentioned function into an .net assembly, you can register the assembly and function in your database.

CREATE ASSEMBLY [PPSqlClrSafe]
AUTHORIZATION [dbo]
FROM 'C:CLRPPSqlClrSafe.dll'
WITH PERMISSION_SET = SAFE
GO

CREATE FUNCTION [ClrSafe].[fn_RemoveAccent](
	@sourceString [nvarchar](max)  --Source string to remove accent
)
RETURNS [nvarchar](4000)
WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [PPSqlClrSafe].[StringsCommon].[RemoveAccent]
GO</pre>
</div>
Once the function is registered, you can easily use it to remove accent (diacritics) from whatever string you want
<div style="max-height:400px;overflow:auto;">
<pre class="brush:sql">WITH Data AS (
    SELECT
        'Czech' AS [Language]
        ,N'á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů' AS [AccentChars] UNION ALL
    SELECT
        'Norwegian' AS [Language]
        ,N'ø' AS [AccentChars] UNION ALL
    SELECT
        'Estonian' AS [Language]
        ,N'õ' AS [AccentChars] UNION ALL
    SELECT
        'French' AS [Language]
        ,N'à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü' AS [AccentChars] UNION ALL
    SELECT
        'Croatian, Vietnamese' AS [Language]
        ,N'đ' AS [AccentChars] UNION ALL
    SELECT
        'Latvian' AS [Language]
        ,N'ā, ē, ī, ū, ļ, ķ, ņ, ģ' AS [AccentChars] UNION ALL
    SELECT
        'Hungarian' AS [Language]
        ,N'ö, ü, ő, ű' AS [AccentChars] UNION ALL
    SELECT
        'Polish' AS [Language]
        ,N'ą, ć, ę, ń, ó, ś, ź, ż' AS [AccentChars] UNION ALL
    SELECT
        'Romanian' AS [Language]
        ,N'ă, â, î, ș, ț' AS [AccentChars] UNION ALL
    SELECT
        'Slovak' AS [Language]
        ,N'ô, ť, ľ, ŕ, ĺ, ä' AS [AccentChars] UNION ALL
    SELECT
        'Spanish' AS [Language]
        ,N'ñ' AS [AccentChars] UNION ALL
    SELECT
        'Swedish' AS [Language]
        ,N'å' AS [AccentChars] UNION ALL
    SELECT
        'Turkish' AS [Language]
        ,N'ç, ş, ğ' AS [AccentChars] UNION ALL
    SELECT
        'Greece' AS [Language]
        ,N'Γ, δ, ξ, Φ' AS [AccentChars]
)
SELECT
    [Language]
    ,[AccentChars]
    ,[ClrSafe].fn_RemoveAccent([AccentChars]) AS [RemovedAccent]
FROM Data

Which produces below result:

Language             AccentChars                                 RemovedAccent
-------------------- ------------------------------------------- --------------------------------------------
Czech                á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů a, e, i, o, u, y, c, d, e, n, r, s, t, z, u
Norwegian            ø                                           ø
Estonian             õ                                           o
French               à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü       a, a, c, e, e, e, e, i, i, o, u, u, u
Croatian, Vietnamese đ                                           đ
Latvian              ā, ē, ī, ū, ļ, ķ, ņ, ģ                      a, e, i, u, l, k, n, g
Hungarian            ö, ü, ő, ű                                  o, u, o, u
Polish               ą, ć, ę, ń, ó, ś, ź, ż                      a, c, e, n, o, s, z, z
Romanian             ă, â, î, ș, ț                               a, a, i, s, t
Slovak               ô, ť, ľ, ŕ, ĺ, ä                            o, t, l, r, l, a
Spanish              ñ                                           n
Swedish              å                                           a
Turkish              ç, ş, ğ                                     c, s, g
Greece               Γ, δ, ξ, Φ                                  Γ, δ, ξ, Φ

When you compare it to the CONVERT with COLLATION method, you can find that this doesn’t have the problem with non existen characters. As the CONVERT converts the characters to particular character set which does not need to contain all the charecters of the source character set. We can see this e.g. on the Norwegian, and Greece characters.

If we use the CONVERT with COLLATION method we receive:

WITH Data AS (
    SELECT
        'Czech' AS [Language]
        ,N'á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů' AS [AccentChars] UNION ALL
    SELECT
        'Norwegian' AS [Language]
        ,N'ø' AS [AccentChars] UNION ALL
    SELECT
        'Estonian' AS [Language]
        ,N'õ' AS [AccentChars] UNION ALL
    SELECT
        'French' AS [Language]
        ,N'à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü' AS [AccentChars] UNION ALL
    SELECT
        'Croatian, Vietnamese' AS [Language]
        ,N'đ' AS [AccentChars] UNION ALL
    SELECT
        'Latvian' AS [Language]
        ,N'ā, ē, ī, ū, ļ, ķ, ņ, ģ' AS [AccentChars] UNION ALL
    SELECT
        'Hungarian' AS [Language]
        ,N'ö, ü, ő, ű' AS [AccentChars] UNION ALL
    SELECT
        'Polish' AS [Language]
        ,N'ą, ć, ę, ń, ó, ś, ź, ż' AS [AccentChars] UNION ALL
    SELECT
        'Romanian' AS [Language]
        ,N'ă, â, î, ș, ț' AS [AccentChars] UNION ALL
    SELECT
        'Slovak' AS [Language]
        ,N'ô, ť, ľ, ŕ, ĺ, ä' AS [AccentChars] UNION ALL
    SELECT
        'Spanish' AS [Language]
        ,N'ñ' AS [AccentChars] UNION ALL
    SELECT
        'Swedish' AS [Language]
        ,N'å' AS [AccentChars] UNION ALL
    SELECT
        'Turkish' AS [Language]
        ,N'ç, ş, ğ' AS [AccentChars] UNION ALL
    SELECT
        'Greece' AS [Language]
        ,N'Γ, δ, ξ, Φ' AS [AccentChars]
)
SELECT
    [Language]
    ,[AccentChars]
    ,CONVERT(varchar(50), [AccentChars]) COLLATE Cyrillic_General_CI_AI AS [RemovedAccent]
FROM Data

Results:

Language             AccentChars                                 RemovedAccent
-------------------- ------------------------------------------- --------------------------------------------
Czech                á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů a, e, i, o, u, y, c, d, e, n, r, s, t, z, u
Norwegian            ø                                           o
Estonian             õ                                           o
French               à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü       a, a, c, e, e, e, e, i, i, o, u, u, u
Croatian, Vietnamese đ                                           d
Latvian              ā, ē, ī, ū, ļ, ķ, ņ, ģ                      a, e, i, u, l, k, n, g
Hungarian            ö, ü, ő, ű                                  o, u, o, u
Polish               ą, ć, ę, ń, ó, ś, ź, ż                      a, c, e, n, o, s, z, z
Romanian             ă, â, î, ș, ț                               a, a, i, ?, ?
Slovak               ô, ť, ľ, ŕ, ĺ, ä                            o, t, l, r, l, a
Spanish              ñ                                           n
Swedish              å                                           a
Turkish              ç, ş, ğ                                     c, s, g
Greece               Γ, δ, ξ, Φ                                  ?, ?, ?, ?

From above we can see, that the CLR solution provides much better results and you do not loose any characters by conversion to non Unicode character set

If you are interested, you can grab sample C# project here: PPSqlClrSafe_AccentRemoval,zip

Any comments are welcomed.

Advertisements

2 thoughts on “Removing Accent (diacritics) using CLR

  1. Thanks for sharing, this was extremely useful for processing a lot of location data with diacritics.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s