Levenshtein Distance Algorithm: Oracle PL/SQL Implementation using a two-dimensional array of numbers
In a previous post, I mentioned a concept called Levenstein Distance (or Edit Distance). I also mentioned that Steve Feuerstein posted some code to the Oracle Magazine site that shows us how to work with a two-dimensional array of numbers.
While thinking about Steve's article, I decided to create a new PL/SQL Function to determine the Levenstein Distance between two objects based on Barbara Boehmer's Levenshtein Distance Algorithm: Oracle PL/SQL Implementation and Steve's article.
Here's the code for all of you to enjoy:
While thinking about Steve's article, I decided to create a new PL/SQL Function to determine the Levenstein Distance between two objects based on Barbara Boehmer's Levenshtein Distance Algorithm: Oracle PL/SQL Implementation and Steve's article.
Here's the code for all of you to enjoy:
--fx_ld.sql
--Levenshtein distance
CREATE OR REPLACE FUNCTION fx_ld(
p_string_orig VARCHAR2,--s
p_string_new VARCHAR2)--t
RETURN NUMBER DETERMINISTIC IS
TYPE data_t IS TABLE OF NUMBER INDEX BY PLS_INTEGER;
TYPE array_t IS TABLE OF data_t INDEX BY PLS_INTEGER;
l_2d_grid array_t;
TYPE char_table_type
IS TABLE OF char(1) INDEX BY binary_integer;
s char_table_type;
t char_table_type;
n PLS_INTEGER := LENGTH(p_string_orig);
m PLS_INTEGER := LENGTH(p_string_new);
lv_cost PLS_INTEGER;
BEGIN
IF n = 0
THEN RETURN m;
END IF;
IF m = 0
THEN RETURN n;
END IF;
--Let's load up the arrays
l_2d_grid (0) (0) := 0;
FOR i IN 1..n
LOOP
s(i) := SUBSTR(p_string_orig, i, 1);
l_2d_grid (i) (0) := i;
FOR j IN 1..m
LOOP
t(j) := SUBSTR(p_string_new, j, 1);
l_2d_grid (0) (j) := j;
IF s(i) = t(j)
THEN lv_cost := 0;
ELSE lv_cost := 1;
END IF;
l_2d_grid (i) (j) := LEAST (
l_2d_grid (i-1)(j) + 1,
l_2d_grid (i)(j-1) + 1,
l_2d_grid (i-1)(j-1) + lv_cost);
--dbms_output.put_line(i||':'||j||' := '||
--l_2d_grid (i) (j)||', cost:='||lv_cost||
--', s(i):='||s(i)||', t(j):='||t(j));
END LOOP;
END LOOP;
--The Levenshtein(Edit) distance is found in cell d[n,m]
RETURN l_2d_grid (n) (m);
END fx_ld;
SQL>select fx_ld('XRICH', 'RICHARD') LD from dual;
LD
-------
4
Comments
Its good to see that other Levenshtein Distance algorithm implementations for Oracle are floating about. I tried your one and felt (although I have no hard proof) that it was faster than Barbaras one. But I ran into a big problem, it doesn't work for strings of numbers eg. fx_ld('476','567'), and instead throws an error about line 28?
Thanks a bundle
Josh
I've tried this on a 10gR1 and 10gR2 database and both work fine.
SQL>select fx_ld('476','567') as ld from dual;
LD
--------------
3
I've uploaded a .sql file to my website with the function for you to download.
http://www.sqlquery.com/fx_ld.sql
If that version doesn't work we can add some debugging in there for you, just let me know...Rich
http://www.psoug.org/reference/utl_match.html
Hyperlink:
http://www.psoug.org/reference/utl_match.html
Rich
Ran the example on that page and sure enough, it works:
RICH.SALT> create table countries(name varchar2(15) not null);
RICH.SALT> insert into countries values ('Poland');
RICH.SALT> insert into countries values ('Germany');
RICH.SALT> insert into countries values ('United States');
RICH.SALT> insert into countries values ('Portugal');
RICH.SALT> insert into countries values ('Czech Republic');
RICH.SALT> insert into countries values ('China');
RICH.SALT> insert into countries values ('Slovakia');
RICH.SALT> insert into countries values ('Slovenia');
RICH.SALT> commit;
RICH.SALT> select name
2 ,to_char(utl_match.edit_distance(name, 'Slovnia'),'999') edit_dist
3 ,to_char(utl_match.edit_distance_similarity(name, 'Slovnia'),'999') edit_dist_sim
4 ,to_char(utl_match.jaro_winkler(name, 'Slovnia'),'999d9999') jaro_winkler
5 ,to_char(utl_match.jaro_winkler_similarity(name, 'Slovnia'),'999') jaro_winkler_sim
6 from countries
7 order by jaro_winkler_sim desc;
NAME EDIT EDIT JARO_WINK JARO
--------------- ---- ---- --------- ----
Slovenia 1 88 .9750 97
Slovakia 2 75 .8881 88
China 5 29 .5619 56
United States 12 8 .5531 55
Poland 6 15 .5317 53
Portugal 7 13 .5119 51
Germany 7 0 .3571 35
Czech Republic 13 8 .0000 0
8 rows selected.
n PLS_INTEGER := NVL(LENGTH(p_string_orig),0);
m PLS_INTEGER := NVL(LENGTH(p_string_new),0);