De-Duplication of names using DataFlux

March 17, 2010

De-Duplication of names using DataFlux

In honor of St. Patrick's day here in the US, I've decided to take a stab at de-duplicating a list of "Patrick" names...

Step 1:
Create a list of names

Step 2:

Assuming you have these names in one single database field, concatenate the records above in the various formats you believe you have in your data. If your not sure about the formats your data is currently stored in, I'd suggest you profile your data.

For this example, I've listed 17 names in three formats (total 51 records):
- First Name & Last Name
- Last Name, First Name
- Last Name/First Name

Step 3:
Create sample table (via direct connection to Oracle database using SQL*Plus)

SQL>CREATE TABLE list_of_names(
         id              NUMBER,
         long_name       VARCHAR2(200),
         df_cluster_id   NUMBER);

Table created.

...51 insert statements here...

SQL>select count(*) from list_of_names;

      COUNT(*)
--------------
            51

Step 4: Create a DataFlux Architect Job which reads the table and parses the "full name" stored in the long_name field:

Notes: As you can see the current Architect Job only has three steps (what I like to call "wigets). The first is simple, just read the table. The second is a utility to run javascript like commands on the fields and in this case I've created a new field called newName which no longer has the slashes in the full name. The third is a parse widget which parses this new field called newName into 6 new name fields. When using the Parse "widget" I like to make sure I've clicked the "Preserve null values" checkbox as well as I've pushed all the fields which have been read into the output of this widget (to do this, click on "additional outputs" and choose all the fields to output).
At this point I like to "preview" the data so I can see what's going on. To do so, right click the "Parsing" widget and choose "preview".

Step 5: Add "Match Codes" to the Architect Job
The DataFlux Architect tool has this fantastic widget which builds "match codes" - kind of a "black magic", let's see if I can try to explain this....
DataFlux Match codes are strings which are created by evaluating the input string and depending on the sensitivity level you choose it will create a string of "gobbly-guck" which will allow you to group records together. Clear as mud? I think perhaps a picture is worth a thousand words, so...
Here is how you create match codes on both first and last names:

And the preview:

Step 6: Add "grouping" to the DataFlux Architect job
Grouping in DataFlux is called clustering and the following shows you how I've "grouped" together data based on two criteria:
A) records with the same long name get grouped together
B) records with the same First Name and Last Name Match Codes get grouped together

Step 7: Update the source table

Step 8: Run the job and query the results:
(query from my SQL*Plus session)

SQL>select df_cluster_id, long_name from list_of_names order by 1, 2;

 DF_CLUSTER_ID LONG_NAME
-------------- -------------------------------------------
             0 DR. Patrick O'Connell
             0 MR Patrick Richard O'Connell II
             0 MR Patrick Richard O'Connell II, CPA
             0 O'Connell II, MR Patrick Richard
             0 O'Connell II, Patrick Richard
             0 O'Connell II/MR Patrick Richard
             0 O'Connell II/Patrick Richard
             0 O'Connell JR, Patrik
             0 O'Connell JR/Patrik
             0 O'Connell, DR. Patrick
             0 O'Connell, Patrick
             0 O'Connell, Patrick Richard
             0 O'Connell/DR. Patrick
             0 O'Connell/Patrick
             0 O'Connell/Patrick Richard
             0 OConnell, Patrick
             0 OConnell, Patrick
             0 OConnell/Patrick
             0 OConnell/Patrick
             0 Patrick O'Connell
             0 Patrick OConnell
             0 Patrick OConnell
             0 Patrick Richard O'Connell
             0 Patrick Richard O'Connell II
             0 Patrik O'Connell JR
             1 O'Connell, Patty
             1 O'Connell/Patty
             1 Patty O'Connell
             2 O'Connell, Pat
             2 O'Connell/Pat
             2 Pat O'Connell
             3 O'Connell, Paddy
             3 O'Connell/Paddy
             3 OConnell, Paddy
             3 OConnell/Paddy
             3 Paddy O'Connell
             3 Paddy OConnell
             4 O'Connell, Patricia
             4 O'Connell/Patricia
             4 Patricia O'Connell
             5 Connell, Richard
             5 Connell/Richard
             5 Richard Connell
             6 O'Connell, Richard Patrick
             6 O'Connell/Richard Patrick
             6 Richard Patrick O'Connell
             7 O'Connell, Pádraic
             7 O'Connell/Pádraic
             7 Pádraic O'Connell
             8 O'Connell II, CPA, MR Patrick Richard
             8 O'Connell II, CPA/MR Patrick Richard

51 rows selected.

SQL>select df_cluster_id, count(*) TOTAL
  2  from list_of_names
  3  GROUP BY df_cluster_id
  4  ORDER BY 1;

 DF_CLUSTER_ID     TOTAL
-------------- ---------
             0        25
             1         3
             2         3
             3         6
             4         3
             5         3
             6         3
             7         3
             8         2
               ---------
sum                   51

9 rows selected.

Notice that we've collapsed 25 records to the first cluster or grouping, very nice! The DataFlux Architect tool allows you to do tons of additional things, such as standardizing these names, pushing the standardized name back to each of these records, merging records, etc... This is just a very quick window into the power of this tool and how it might help you de-duplicate names. I strongly recommend that if you do work like this you look into tools like DataFlux. Others vendors (including Trillium, IBM, etc.) have these capabilities as well and an investment in a tool like this can really assist any data management team and speed up most data management projects significantly.

Until next time...Rich

Comments

Unknown said…

Rich

Nice post. But how about using another Irish name (Daragh) and parse the 13+ spelling variations, the dozen or so "transcription error" versions (Daryl, Darren, Dora...), and then figure out the gender of the data subject based on title and firstname, and then de-dupe.

... the upshot being a comment that name and address data is really great fun to work with if you like puzzles and you need to be aware of cultural/linguistic issues in data if you want to avoid false positives or false negatives.

Mar 22, 2010, 8:56:00 AM

Unknown said…

Bizarre that blogger put my name as "Daragh's Reference" on my last comment.

Mar 22, 2010, 8:59:00 AM

Unknown said…

Can i get more information about how can we update the existing cluster by using the real time cluster update

Mar 29, 2010, 9:53:00 AM

Search This Blog

Rich Murnane's Blog

De-Duplication of names using DataFlux

Comments

Popular Posts

Levenshtein Distance Algorithm: Oracle PL/SQL Implementation using a two-dimensional array of numbers

Data Architect - DC Metro Area (Annapolis MD)