Some times it may happen, that you need to remove accent (diacritics) from string when you are querying data or when you are storing data into particular table.
There are several methods how to achieve this. One of the possible solution is using a CONVERT with COLLATE as Twitted by MVP Mladen Prajdic and mentioned on Luke Jian blog, but this method has an issue I will mention below.
Other solution could be replacement of accented characters with the ones without accent. But this is a very problematic solution as you have to do a lot of replacement (the performance can suffer) and it will be problematic to cover all the possible accent characters and you easily miss one. Similar to this could be creation of a replacement table and implementation of function similar to Splitting function using Tally Table (Jeff Moden). But again you will have to write all possible combination of accented and not accented characters into the replacement table.
So if you do not want to change the collation and would like to avoid possible issues of the CONVERT method and avoid writing replacement tables etc., you can easily use a very simple amd known CLR method. This method is using string normalization and CharUnicodeInfo
class from the System.Globalization
name space. Based on this you can write a very simple scalar CLR function.
using System;
using System.Collections.Generic;
using System.Text;
using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;
using System.Globalization;
public class StringsCommon
{
/// <summary>
/// Removes Accent (Diacritics) from string
/// </summary>
/// <param name="sourceString">Source string fro wchich accent should be removed</param>
/// <returns>string without accent</returns>
[SqlFunction(IsDeterministic = true)]
public static SqlChars RemoveAccent(SqlString sourceString)
{
if (sourceString.IsNull)
return SqlChars.Null;
string normalized = sourceString.Value.Normalize(NormalizationForm.FormD);
StringBuilder output = new StringBuilder(sourceString.Value.Length);
foreach (char ch in normalized)
{
if (CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.NonSpacingMark)
output.Append(ch);
}
return new SqlChars(output.ToString());
}
}
Once you compile above mentioned function into an .net assembly, you can register the assembly and function in your database.
CREATE ASSEMBLY [PPSqlClrSafe]
AUTHORIZATION [dbo]
FROM 'C:CLRPPSqlClrSafe.dll'
WITH PERMISSION_SET = SAFE
GO
CREATE FUNCTION [ClrSafe].[fn_RemoveAccent](
@sourceString [nvarchar](max) --Source string to remove accent
)
RETURNS [nvarchar](4000)
WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [PPSqlClrSafe].[StringsCommon].[RemoveAccent]
GO</pre>
</div>
Once the function is registered, you can easily use it to remove accent (diacritics) from whatever string you want
<div style="max-height:400px;overflow:auto;">
<pre class="brush:sql">WITH Data AS (
SELECT
'Czech' AS [Language]
,N'á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů' AS [AccentChars] UNION ALL
SELECT
'Norwegian' AS [Language]
,N'ø' AS [AccentChars] UNION ALL
SELECT
'Estonian' AS [Language]
,N'õ' AS [AccentChars] UNION ALL
SELECT
'French' AS [Language]
,N'à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü' AS [AccentChars] UNION ALL
SELECT
'Croatian, Vietnamese' AS [Language]
,N'đ' AS [AccentChars] UNION ALL
SELECT
'Latvian' AS [Language]
,N'ā, ē, ī, ū, ļ, ķ, ņ, ģ' AS [AccentChars] UNION ALL
SELECT
'Hungarian' AS [Language]
,N'ö, ü, ő, ű' AS [AccentChars] UNION ALL
SELECT
'Polish' AS [Language]
,N'ą, ć, ę, ń, ó, ś, ź, ż' AS [AccentChars] UNION ALL
SELECT
'Romanian' AS [Language]
,N'ă, â, î, ș, ț' AS [AccentChars] UNION ALL
SELECT
'Slovak' AS [Language]
,N'ô, ť, ľ, ŕ, ĺ, ä' AS [AccentChars] UNION ALL
SELECT
'Spanish' AS [Language]
,N'ñ' AS [AccentChars] UNION ALL
SELECT
'Swedish' AS [Language]
,N'å' AS [AccentChars] UNION ALL
SELECT
'Turkish' AS [Language]
,N'ç, ş, ğ' AS [AccentChars] UNION ALL
SELECT
'Greece' AS [Language]
,N'Γ, δ, ξ, Φ' AS [AccentChars]
)
SELECT
[Language]
,[AccentChars]
,[ClrSafe].fn_RemoveAccent([AccentChars]) AS [RemovedAccent]
FROM Data
Which produces below result:
Language AccentChars RemovedAccent
-------------------- ------------------------------------------- --------------------------------------------
Czech á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů a, e, i, o, u, y, c, d, e, n, r, s, t, z, u
Norwegian ø ø
Estonian õ o
French à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü a, a, c, e, e, e, e, i, i, o, u, u, u
Croatian, Vietnamese đ đ
Latvian ā, ē, ī, ū, ļ, ķ, ņ, ģ a, e, i, u, l, k, n, g
Hungarian ö, ü, ő, ű o, u, o, u
Polish ą, ć, ę, ń, ó, ś, ź, ż a, c, e, n, o, s, z, z
Romanian ă, â, î, ș, ț a, a, i, s, t
Slovak ô, ť, ľ, ŕ, ĺ, ä o, t, l, r, l, a
Spanish ñ n
Swedish å a
Turkish ç, ş, ğ c, s, g
Greece Γ, δ, ξ, Φ Γ, δ, ξ, Φ
When you compare it to the CONVERT with COLLATION method, you can find that this doesn’t have the problem with non existen characters. As the CONVERT converts the characters to particular character set which does not need to contain all the charecters of the source character set. We can see this e.g. on the Norwegian, and Greece characters.
If we use the CONVERT with COLLATION method we receive:
WITH Data AS (
SELECT
'Czech' AS [Language]
,N'á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů' AS [AccentChars] UNION ALL
SELECT
'Norwegian' AS [Language]
,N'ø' AS [AccentChars] UNION ALL
SELECT
'Estonian' AS [Language]
,N'õ' AS [AccentChars] UNION ALL
SELECT
'French' AS [Language]
,N'à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü' AS [AccentChars] UNION ALL
SELECT
'Croatian, Vietnamese' AS [Language]
,N'đ' AS [AccentChars] UNION ALL
SELECT
'Latvian' AS [Language]
,N'ā, ē, ī, ū, ļ, ķ, ņ, ģ' AS [AccentChars] UNION ALL
SELECT
'Hungarian' AS [Language]
,N'ö, ü, ő, ű' AS [AccentChars] UNION ALL
SELECT
'Polish' AS [Language]
,N'ą, ć, ę, ń, ó, ś, ź, ż' AS [AccentChars] UNION ALL
SELECT
'Romanian' AS [Language]
,N'ă, â, î, ș, ț' AS [AccentChars] UNION ALL
SELECT
'Slovak' AS [Language]
,N'ô, ť, ľ, ŕ, ĺ, ä' AS [AccentChars] UNION ALL
SELECT
'Spanish' AS [Language]
,N'ñ' AS [AccentChars] UNION ALL
SELECT
'Swedish' AS [Language]
,N'å' AS [AccentChars] UNION ALL
SELECT
'Turkish' AS [Language]
,N'ç, ş, ğ' AS [AccentChars] UNION ALL
SELECT
'Greece' AS [Language]
,N'Γ, δ, ξ, Φ' AS [AccentChars]
)
SELECT
[Language]
,[AccentChars]
,CONVERT(varchar(50), [AccentChars]) COLLATE Cyrillic_General_CI_AI AS [RemovedAccent]
FROM Data
Results:
Language AccentChars RemovedAccent
-------------------- ------------------------------------------- --------------------------------------------
Czech á, é, í, ó, ú, ý, č, ď, ě, ň, ř, š, ť, ž, ů a, e, i, o, u, y, c, d, e, n, r, s, t, z, u
Norwegian ø o
Estonian õ o
French à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü a, a, c, e, e, e, e, i, i, o, u, u, u
Croatian, Vietnamese đ d
Latvian ā, ē, ī, ū, ļ, ķ, ņ, ģ a, e, i, u, l, k, n, g
Hungarian ö, ü, ő, ű o, u, o, u
Polish ą, ć, ę, ń, ó, ś, ź, ż a, c, e, n, o, s, z, z
Romanian ă, â, î, ș, ț a, a, i, ?, ?
Slovak ô, ť, ľ, ŕ, ĺ, ä o, t, l, r, l, a
Spanish ñ n
Swedish å a
Turkish ç, ş, ğ c, s, g
Greece Γ, δ, ξ, Φ ?, ?, ?, ?
From above we can see, that the CLR solution provides much better results and you do not loose any characters by conversion to non Unicode character set
If you are interested, you can grab sample C# project here: PPSqlClrSafe_AccentRemoval,zip
Any comments are welcomed.
Like this:
Like Loading...