Thursday, April 21, 2011

How it is done...

First of all I would like to thank the people who were kind enough to help the world by posting these two references on the web:

MSSQL Levenshtein
Creating Stored Procedures and User-Defined Functions with Managed Code

They actually saved my life after we found out that the new Fuzzy Matching engine that our development team had to implement, crashed on us as soon as the tens of thousands daily transactions started bombarding it!

Anyway, if you had a look at the reference blogs you will have a fair idea of the secret I'm about to share...

C# and Dynamic-Link Libraries (DLL)
(Oh no... C# isn't that .net? MS Sql is MS and MS is .net so FACE IT!)

Open a notepad or if you have it an IDE and write the following code:

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public partial class StoredFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = false)]
    public static SqlDouble Levenshtein(SqlString S1, SqlString S2)
    {
        if(S1.IsNull)
          S1 = new SqlString("");

        if(S2.IsNull)
          S2 = new SqlString("");

        String SC1 = S1.Value.ToUpper();
        String SC2 = S2.Value.ToUpper();
        
        int n = SC1.Length;
        int m = SC2.Length;

        int[,] d = new int[n + 1, m + 1];
        int cost = 0;
        
        if (n + m == 0) {
            return 100;
        } else if (n == 0) {
            return 0;
        } else if (m == 0) {
            return 0;
        }

        for (int i = 0; i <= n; i++)
            d[i, 0] = i;

        for (int j = 0; j <= m; j++)
            d[0, j] = j;

        for (int i = 1; i <= n; i++)
        {
            for (int j = 1; j <= m; j++)
            {
                if (SC1[i - 1] == SC2[j - 1])
                    cost = 0;
                else
                    cost = 1;

                d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
            }
        }
          
        double percentage = System.Math.Round((1.0 - ((double)d[n, m]/(double)System.Math.Max(n,m))) * 100.0,2);
        return percentage;
    }
};

Save this file at any location on your hard drive... Call it for instance 'levenshtein.cs'.

Next step of the process is to compile this little class into a dll file.

Windows comes installed with a C# compiler so this should be no problem, just open a dos command window, point it to the location you saved this file and type:
E:\projects\C#\DBObjects>csc.exe /t:library /out:UserFunctions.dll levenshtein.cs

You will get a response that looks something like this:
Microsoft (R) Visual C# 2008 Compiler version 3.5.30729.5420
for Microsoft (R) .NET Framework version 3.5
Copyright (C) Microsoft Corporation. All rights reserved.


E:\projects\C#\DBObjects>
You will be notified with compilation errors if you made a mistake in your script.

If the dos box comes back with something like:

'csc.exe' is not recognized as an internal or external command, operable program or batch file.

That just means you need to add the path to your c# folder to your windows environment PATH variable. Just search your computer for 'csc.exe' and you should find multiple hits. If a compilation fails, just try one of the others or make sure that the folder containing the csc.exe file also contains the System.Data.dll file. Sooner or later you will find one that works to compile the levenshtein.cs file.

In the same folder as where you created the levenshtein.cs file you will now find a file called 'UserFunctions.dll' (That is if you named it that way in the dos command).

That dll file we can now use to create an assembly in MS Sql.

To do that, open your Microsoft SQL Server Management Server Studio and open the database you want to bind the dll to.

Open the programmability folder, right click on Assemblies and click New Assembly...




















Next click the browse button in the window that opens up and select your newly created dll file.

























Now you will see that your Assemblies contains an extra package called UserFunctions (if that is what you called the dll file)





















In order for SQL to start using the functions contained in your Assembly we will create a wrapper function in native SQL-talk.
I open a new query in my Microsoft SQL Server Manager and type the following and run it...
CREATE Function fn_Levenshtein(@S1 nvarchar(4000), @S2 nvarchar(4000))
    RETURNS float as EXTERNAL NAME UserFunctions.StoredFunctions.Levenshtein
GO

UserFunctions.StoredFunctions.Levenshtein or in other words: AssemblyName.ClassName.FunctionName.

Now you can use the fn_levenshtein function to your leisure. Testing

SELECT dbo.fn_Levenshtein(
 replicate('abcdeghij',299),
 replicate('abcdfghij',299)
)

Should give you a result of 88.89% match.

It is possible that your SQL server is set up to not allow clr functions. This you can fix easily running the query:

sp_configure 'clr enabled', 1
GO
reconfigure
GO

There, now you can have a ball and boost your fuzzy matching queries with Levenshtein logic at a very low cost as you will soon find out!

I hope this may be of help to anyone!

Why should you continue reading...

Another post about the Levenshtein distance in ms sql... Why should you bother reading it?

There's ample examples available out there to create a function in sql, but I can give you one very good reason to keep reading... SPEED!

I will start by showing you the benefits so you know whether or not this is interesting for you. It could save you some valuable time reading stuff that makes no sense.

First a screenshot of the results you get when using a levenshtein function written in native SQL:















Here I compare two strings of 2400 characters each and as you see hilighted in green it took about 20 seconds to do the job, yielding a result of 88.89% match. Ok, my server might not be running optimal, but it is about a comparison and the next screenshot I show will be from the same server.

Next a screenshot of the results you get when using a levenshtein function using the methods described further down: (Keeping the secret until the last moment)















Comparing the same string now takes (indeed, there is no trickery here)...
0, naught, nil, zero seconds, yielding the same result of 88.89% match. My functions return percentage as result since I find that makes more sense than the number of edits a levenshtein function would normally return.
And in fact if you use the Sql Profiler you will find that it takes less than a millisecond, it is so fast that the profiler doesn't even notice a time difference between the start and the end of the function!

Does that at least tickle your fancy?

If it does then check out my next post where I will show you how it is done.
I will show you both ways, so you can check if maybe my native Sql function could be optimised to yield the same results... I think not, but I like to be surprised.

How it is done...