Removing Accent (diacritics) using CLR

Some times it may happen, that you need to remove accent (diacritics) from string when you are querying data or when you are storing data into particular table.

There are several methods how to achieve this. One of the possible solution is using a CONVERT with COLLATE as Twitted by MVP Mladen Prajdic and mentioned on Luke Jian blog, but this method has an issue I will mention below.

Other solution could be replacement of accented characters with the ones without accent. But this is a very problematic solution as you have to do a lot of replacement (the performance can suffer) and it will be problematic to cover all the possible accent characters and you easily miss one. Similar to this could be creation of a replacement table and implementation of function similar to Splitting function using Tally Table (Jeff Moden). But again you will have to write all possible combination of accented and not accented characters into the replacement table.

So if you do not want to change the collation and would like to avoid possible issues of the CONVERT method and avoid writing replacement tables etc., you can easily use a very simple amd known CLR method. This method is using string normalization and CharUnicodeInfo class from the System.Globalization name space. Based on this you can write a very simple scalar CLR function.

using System;
using System.Collections.Generic;
using System.Text;
using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;
using System.Globalization;

public class StringsCommon
{
    /// <summary>
    /// Removes Accent (Diacritics) from string
    /// </summary>
    /// <param name="sourceString">Source string fro wchich accent should be removed</param>
    /// <returns>string without accent</returns>
    [SqlFunction(IsDeterministic = true)]
    public static SqlChars RemoveAccent(SqlString sourceString)
    {
        if (sourceString.IsNull)
            return SqlChars.Null;

        string normalized = sourceString.Value.Normalize(NormalizationForm.FormD);

        StringBuilder output = new StringBuilder(sourceString.Value.Length);

        foreach (char ch in normalized)
        {
            if (CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.NonSpacingMark)
                output.Append(ch);
        }
        return new SqlChars(output.ToString());
    }
}

Once you compile above mentioned function into an .net assembly, you can register the assembly and function in your database.

CREATE ASSEMBLY [PPSqlClrSafe]
AUTHORIZATION [dbo]
FROM 'C:CLRPPSqlClrSafe.dll'
WITH PERMISSION_SET = SAFE
GO

CREATE FUNCTION [ClrSafe].[fn_RemoveAccent](
	@sourceString [nvarchar](max)  --Source string to remove accent
)
RETURNS [nvarchar](4000)
WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [PPSqlClrSafe].[StringsCommon].[RemoveAccent]
GO</pre>
</div>
Once the function is registered, you can easily use it to remove accent (diacritics) from whatever string you want
<div style="max-height:400px;overflow:auto;">
<pre class="brush:sql">WITH Data AS (
    SELECT
        'Czech' AS [Language]
        ,N'á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů' AS [AccentChars] UNION ALL
    SELECT
        'Norwegian' AS [Language]
        ,N'ø' AS [AccentChars] UNION ALL
    SELECT
        'Estonian' AS [Language]
        ,N'õ' AS [AccentChars] UNION ALL
    SELECT
        'French' AS [Language]
        ,N'à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü' AS [AccentChars] UNION ALL
    SELECT
        'Croatian, Vietnamese' AS [Language]
        ,N'đ' AS [AccentChars] UNION ALL
    SELECT
        'Latvian' AS [Language]
        ,N'ā, ē, ī, ū, ļ, ķ, ņ, ģ' AS [AccentChars] UNION ALL
    SELECT
        'Hungarian' AS [Language]
        ,N'ö, ü, ő, ű' AS [AccentChars] UNION ALL
    SELECT
        'Polish' AS [Language]
        ,N'ą, ć, ę, ń, ó, ś, ź, ż' AS [AccentChars] UNION ALL
    SELECT
        'Romanian' AS [Language]
        ,N'ă, â, î, ș, ț' AS [AccentChars] UNION ALL
    SELECT
        'Slovak' AS [Language]
        ,N'ô, ť, ľ, ŕ, ĺ, ä' AS [AccentChars] UNION ALL
    SELECT
        'Spanish' AS [Language]
        ,N'ñ' AS [AccentChars] UNION ALL
    SELECT
        'Swedish' AS [Language]
        ,N'å' AS [AccentChars] UNION ALL
    SELECT
        'Turkish' AS [Language]
        ,N'ç, ş, ğ' AS [AccentChars] UNION ALL
    SELECT
        'Greece' AS [Language]
        ,N'Γ, δ, ξ, Φ' AS [AccentChars]
)
SELECT
    [Language]
    ,[AccentChars]
    ,[ClrSafe].fn_RemoveAccent([AccentChars]) AS [RemovedAccent]
FROM Data

Which produces below result:

Language             AccentChars                                 RemovedAccent
-------------------- ------------------------------------------- --------------------------------------------
Czech                á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů a, e, i, o, u, y, c, d, e, n, r, s, t, z, u
Norwegian            ø                                           ø
Estonian             õ                                           o
French               à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü       a, a, c, e, e, e, e, i, i, o, u, u, u
Croatian, Vietnamese đ                                           đ
Latvian              ā, ē, ī, ū, ļ, ķ, ņ, ģ                      a, e, i, u, l, k, n, g
Hungarian            ö, ü, ő, ű                                  o, u, o, u
Polish               ą, ć, ę, ń, ó, ś, ź, ż                      a, c, e, n, o, s, z, z
Romanian             ă, â, î, ș, ț                               a, a, i, s, t
Slovak               ô, ť, ľ, ŕ, ĺ, ä                            o, t, l, r, l, a
Spanish              ñ                                           n
Swedish              å                                           a
Turkish              ç, ş, ğ                                     c, s, g
Greece               Γ, δ, ξ, Φ                                  Γ, δ, ξ, Φ

When you compare it to the CONVERT with COLLATION method, you can find that this doesn’t have the problem with non existen characters. As the CONVERT converts the characters to particular character set which does not need to contain all the charecters of the source character set. We can see this e.g. on the Norwegian, and Greece characters.

If we use the CONVERT with COLLATION method we receive:

WITH Data AS (
    SELECT
        'Czech' AS [Language]
        ,N'á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů' AS [AccentChars] UNION ALL
    SELECT
        'Norwegian' AS [Language]
        ,N'ø' AS [AccentChars] UNION ALL
    SELECT
        'Estonian' AS [Language]
        ,N'õ' AS [AccentChars] UNION ALL
    SELECT
        'French' AS [Language]
        ,N'à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü' AS [AccentChars] UNION ALL
    SELECT
        'Croatian, Vietnamese' AS [Language]
        ,N'đ' AS [AccentChars] UNION ALL
    SELECT
        'Latvian' AS [Language]
        ,N'ā, ē, ī, ū, ļ, ķ, ņ, ģ' AS [AccentChars] UNION ALL
    SELECT
        'Hungarian' AS [Language]
        ,N'ö, ü, ő, ű' AS [AccentChars] UNION ALL
    SELECT
        'Polish' AS [Language]
        ,N'ą, ć, ę, ń, ó, ś, ź, ż' AS [AccentChars] UNION ALL
    SELECT
        'Romanian' AS [Language]
        ,N'ă, â, î, ș, ț' AS [AccentChars] UNION ALL
    SELECT
        'Slovak' AS [Language]
        ,N'ô, ť, ľ, ŕ, ĺ, ä' AS [AccentChars] UNION ALL
    SELECT
        'Spanish' AS [Language]
        ,N'ñ' AS [AccentChars] UNION ALL
    SELECT
        'Swedish' AS [Language]
        ,N'å' AS [AccentChars] UNION ALL
    SELECT
        'Turkish' AS [Language]
        ,N'ç, ş, ğ' AS [AccentChars] UNION ALL
    SELECT
        'Greece' AS [Language]
        ,N'Γ, δ, ξ, Φ' AS [AccentChars]
)
SELECT
    [Language]
    ,[AccentChars]
    ,CONVERT(varchar(50), [AccentChars]) COLLATE Cyrillic_General_CI_AI AS [RemovedAccent]
FROM Data

Results:

Language             AccentChars                                 RemovedAccent
-------------------- ------------------------------------------- --------------------------------------------
Czech                á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů a, e, i, o, u, y, c, d, e, n, r, s, t, z, u
Norwegian            ø                                           o
Estonian             õ                                           o
French               à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü       a, a, c, e, e, e, e, i, i, o, u, u, u
Croatian, Vietnamese đ                                           d
Latvian              ā, ē, ī, ū, ļ, ķ, ņ, ģ                      a, e, i, u, l, k, n, g
Hungarian            ö, ü, ő, ű                                  o, u, o, u
Polish               ą, ć, ę, ń, ó, ś, ź, ż                      a, c, e, n, o, s, z, z
Romanian             ă, â, î, ș, ț                               a, a, i, ?, ?
Slovak               ô, ť, ľ, ŕ, ĺ, ä                            o, t, l, r, l, a
Spanish              ñ                                           n
Swedish              å                                           a
Turkish              ç, ş, ğ                                     c, s, g
Greece               Γ, δ, ξ, Φ                                  ?, ?, ?, ?

From above we can see, that the CLR solution provides much better results and you do not loose any characters by conversion to non Unicode character set

If you are interested, you can grab sample C# project here: PPSqlClrSafe_AccentRemoval,zip

Any comments are welcomed.

Advertisements

SSIS 2008 Columns To XML Data Flow Transformation

This article represents a custom components for transformation of columns inside SSIS Data Flow into XML.

If you are writing ETL processes using SSIS, you many times comes to situation, when you need log errors or other problems which can occur during ETL process. An example can be a failure of data conversion during the ETL process. In that situation you mostly would like to know that record exactly caused that error and log such conversion errors into particular error table.

Problem occurs when you have a lot of tables and want to see what data were converted and caused that particular error. As SSIS requires that the sources and destinations needs to be strongly mapped and you have different sources with different structures, this leads to a situation where you have a lot of different error tables – generally one per the source structure at which an error can occur.

This situation can be easily solved by a custom Data Flow component which will convert all source columns into an XML. In that situation you can only have one Error Table, which will contain e.g. Error Code, Error Column, Description and an XML Field in which will be all the data of particular row which caused error encoded in XML Form. This rapidly simplifies error logging and also error log monitoring as you need to monitor only a single table.

So the SSIS Data Flow with the Custom Columns to XML Transformation could look like the one below.

Columns To Xml Data FLow

ColumnsToXML PropertiesColumnsToXML Input Columns

You can provide a SourceID and SourceName properties which are then part of the produced XML so you can easily identify the source. In the Input Columns you select columns you want to be written to the XML. You can also specify an OutpuAlias and the the column names written into the XML will be the one specified in the OutputAlias property.

The Log written using that component and the XML produced will look like the ones below.

ColumnsToXML Log

<row sourceID="1" sourceName="Test Data Conversion">
  <Column name="Name" id="235" lineageId="72">Joseph</Column>
  <Column name="ValueInt" id="236" lineageId="75">2</Column>
  <Column name="ValueDate" id="237" lineageId="78">2012/01/35</Column>
</row>

Using the .value method of XML data type we can easily querying necessary information from the log table for whatever source in the table. So for our sample scenario we would like to query the Name, ValueInt and ValueDate columns stored as XML. This can be easily achieved by below SQL query:

SELECT
    [RowID]
    ,[ErrorCode]
    ,[ErrorColumn]
    ,[ColumnsXMLData].value('data((/row/Column[@name="Name"])[1])', 'varchar(128)') AS Name
    ,[ColumnsXMLData].value('data((/row/Column[@name="ValueInt"])[1])', 'varchar(10)') AS ValueInt
    ,[ColumnsXMLData].value('data((/row/Column[@name="ValueDate"])[1])', 'varchar(10)') AS ValueDate
FROM [TestDB].[dbo].[ETLErrorLog]
WHERE [ColumnsXMLData].value('data((/row/@sourceID)[1])', 'int') = 1

As you can see from the samples above, you can create easily create single Error table for whatever ETL logging and simplify the logging rapidly as you#160; do not need to create the error tables with separate structures for each source.

ColumnsToXML Implementation

The ColumnsToXmlTransform is a custom implementation of the PipelineComponent. It is developed in C# for .NET framework 3.5 and uses Linq to XML for creating the XML output.

The component processes selected input columns, generates the XML and stores it in the output column called ColumnsXMLData.

Except the ColumsXMLDataSourceID and SourceName) to allow you to identify the source row source once it is written as XML into the database and simplifies querying for data coming from particular sources as showed above.

Once the component is built, it is necessary to install it into the Global Assembly Cache and to the PipelineComponents folder in the SQL Server Integration services installation path on the machine on which the SSIS using this component will be running.

To install it into a GAC you can use the gacutil.exe which is part of the Windows SDK.

Component is now available publicly with other components on GitLab as projects for SSIS 2008 – SSIS 2016.

If you run the development environment as with elevated Administrator privileges, the project contains Post Build events, which will install the assembly into the GAC (it assumes you have Windows SDK 7.0 installed – if you have newer version, modify the path to the gacutil.exe accordingly) and also copies the assembly into the codePipelineComponents/code folder of the default SQL Server Integration services installation. If you installation directory differ, please modify the paths in Post Build events accordingly.

Once you have compiled the code, installed in GAC and copied to the PipelineComponents folder, you can include the component to the data flow items palette. Right-click on the palette and select Choose Items

Choose Data Flow Transformation Items Pop-upChoose SSIS Data Flow Items

Once you select the component, it will appear on the toolbox and you can start using it.

Columns To XML On Palette

Once I will have more time, I will try to create an installer for the component. Anyway, any comments are welcome.