Monday, February 20, 2006

Levenshtein Distance Algorithm: Oracle PL/SQL Implementation using a two-dimensional array of numbers

In a previous post, I mentioned a concept called Levenstein Distance (or Edit Distance). I also mentioned that Steve Feuerstein posted some code to the Oracle Magazine site that shows us how to work with a two-dimensional array of numbers.

While thinking about Steve's article, I decided to create a new PL/SQL Function to determine the Levenstein Distance between two objects based on Barbara Boehmer's Levenshtein Distance Algorithm: Oracle PL/SQL Implementation and Steve's article.

Here's the code for all of you to enjoy:


--fx_ld.sql
--Levenshtein distance
CREATE OR REPLACE FUNCTION fx_ld(
p_string_orig VARCHAR2,--s
p_string_new VARCHAR2)--t
RETURN NUMBER DETERMINISTIC IS
TYPE data_t IS TABLE OF NUMBER INDEX BY PLS_INTEGER;
TYPE array_t IS TABLE OF data_t INDEX BY PLS_INTEGER;
l_2d_grid array_t;
TYPE char_table_type
IS TABLE OF char(1) INDEX BY binary_integer;
s char_table_type;
t char_table_type;
n PLS_INTEGER := LENGTH(p_string_orig);
m PLS_INTEGER := LENGTH(p_string_new);
lv_cost PLS_INTEGER;

BEGIN

IF n = 0
THEN RETURN m;
END IF;

IF m = 0
THEN RETURN n;
END IF;

--Let's load up the arrays
l_2d_grid (0) (0) := 0;
FOR i IN 1..n
LOOP
s(i) := SUBSTR(p_string_orig, i, 1);
l_2d_grid (i) (0) := i;

FOR j IN 1..m
LOOP
t(j) := SUBSTR(p_string_new, j, 1);
l_2d_grid (0) (j) := j;
IF s(i) = t(j)
THEN lv_cost := 0;
ELSE lv_cost := 1;
END IF;

l_2d_grid (i) (j) := LEAST (
l_2d_grid (i-1)(j) + 1,
l_2d_grid (i)(j-1) + 1,
l_2d_grid (i-1)(j-1) + lv_cost);
--dbms_output.put_line(i||':'||j||' := '||
--l_2d_grid (i) (j)||', cost:='||lv_cost||
--', s(i):='||s(i)||', t(j):='||t(j));
END LOOP;

END LOOP;

--The Levenshtein(Edit) distance is found in cell d[n,m]
RETURN l_2d_grid (n) (m);

END fx_ld;

SQL>select fx_ld('XRICH', 'RICHARD') LD from dual;

LD
-------
4

8 comments:

Josh K said...

Hi Rich

Its good to see that other Levenshtein Distance algorithm implementations for Oracle are floating about. I tried your one and felt (although I have no hard proof) that it was faster than Barbaras one. But I ran into a big problem, it doesn't work for strings of numbers eg. fx_ld('476','567'), and instead throws an error about line 28?

Thanks a bundle

Josh

Rich said...

hmm... maybe it's a version or NLS language thing?

I've tried this on a 10gR1 and 10gR2 database and both work fine.

SQL>select fx_ld('476','567') as ld from dual;

LD
--------------
3

I've uploaded a .sql file to my website with the function for you to download.

http://www.sqlquery.com/fx_ld.sql

If that version doesn't work we can add some debugging in there for you, just let me know...Rich

jkff said...

Oracle has the UTL_MATCH package that has these functions for free.
http://www.psoug.org/reference/utl_match.html

Rich said...

Thanks jkff, looks like that package is supplied with Oracle 11g and it looks helpful.

Hyperlink:
http://www.psoug.org/reference/utl_match.html

Rich

Rich said...

Found this thread, so I dug around and sure enough my 10gR2 database on my windoz box has the package.

Ran the example on that page and sure enough, it works:

RICH.SALT> create table countries(name varchar2(15) not null);
RICH.SALT> insert into countries values ('Poland');
RICH.SALT> insert into countries values ('Germany');
RICH.SALT> insert into countries values ('United States');
RICH.SALT> insert into countries values ('Portugal');
RICH.SALT> insert into countries values ('Czech Republic');
RICH.SALT> insert into countries values ('China');
RICH.SALT> insert into countries values ('Slovakia');
RICH.SALT> insert into countries values ('Slovenia');
RICH.SALT> commit;

RICH.SALT> select name
2 ,to_char(utl_match.edit_distance(name, 'Slovnia'),'999') edit_dist
3 ,to_char(utl_match.edit_distance_similarity(name, 'Slovnia'),'999') edit_dist_sim
4 ,to_char(utl_match.jaro_winkler(name, 'Slovnia'),'999d9999') jaro_winkler
5 ,to_char(utl_match.jaro_winkler_similarity(name, 'Slovnia'),'999') jaro_winkler_sim
6 from countries
7 order by jaro_winkler_sim desc;

NAME EDIT EDIT JARO_WINK JARO
--------------- ---- ---- --------- ----
Slovenia 1 88 .9750 97
Slovakia 2 75 .8881 88
China 5 29 .5619 56
United States 12 8 .5531 55
Poland 6 15 .5317 53
Portugal 7 13 .5119 51
Germany 7 0 .3571 35
Czech Republic 13 8 .0000 0

8 rows selected.

Bulat said...

You also need to change n and m declarations if values are nullable:
n PLS_INTEGER := NVL(LENGTH(p_string_orig),0);
m PLS_INTEGER := NVL(LENGTH(p_string_new),0);

pbsl said...
This comment has been removed by a blog administrator.
Aymen Medimagh said...

Thank you, I am trying it now