A very inexpensive way to profile a string field in Oracle

I've got a bag of Oracle PL/SQL tricks that I carry around with me from project to project and one of them is a simple little function I created to "profile" a string. Profiling strings is a useful technique when your not sure what data is contained in string that might contain alpha, numeric and and when your wondering if their are patterns in the way strings are constructed.

Some examples might be:
  • Phone Numbers stored as strings
    Phone numbers stored in databases are typically very unstructured. Do they have the "country" code? Do they contain parenthesis or dashes?
  • Product ID's
    Many times you'll find Product ID's type fields that are supposed to comply to certain rules (i.e. Two alpha characters followed by 10 numeric characters).
  • U.S. Social Security Numbers
    Please tell me you don't store these in your database, but if must do so in fact you can take a peek at how they are constructed with the below function and query.
  • ...etc...

    Here's the code
    CREATE OR REPLACE FUNCTION fx_string_profile(
    p_str VARCHAR2)
    lv_upper_char VARCHAR2(100)
    lv_lower_char VARCHAR2(100)
    := 'abcdefghijklmnopqrstuvwxyz';
    lv_number_char VARCHAR2(100)
    := '0123456789';
    lv_upper_char_out VARCHAR2(1) := 'X';
    lv_lower_char_out VARCHAR2(1) := 'x';
    lv_number_char_out VARCHAR2(1) := '9';
    lv_char VARCHAR2(1);
    lv_string VARCHAR2(2000);

    IF p_str IS NULL
    END IF;

    FOR i IN 1..length(p_str) LOOP

    lv_char := substr(p_str, i, 1);

    IF instr(lv_upper_char, lv_char) > 0
    THEN lv_string := lv_string||lv_upper_char_out;
    ELSIF instr(lv_lower_char, lv_char) > 0
    THEN lv_string := lv_string||lv_lower_char_out;
    ELSIF instr(lv_number_char, lv_char) > 0
    THEN lv_string := lv_string||lv_number_char_out;
    ELSE lv_string := lv_string||lv_char;
    END IF;


    RETURN lv_string;

    WHEN others THEN RAISE;

    In action
    In this example, I've loaded up about a thousand phone numbers from a real world database where most of the numbers are "US centric". As you can see from the results of the query, the formats (or "string profile") of these phone numbers are really all over the map.
    SQL> CREATE TABLE test_numbers (
    2 phone_number_str VARCHAR2(100));

    Table created.

    SQL> @i.test_numbers.sql
    999 rows created.

    Commit complete.

    SQL> SELECT fx_string_profile(phone_number_str)
    AS phone_num_profile,
    round(100*ratio_to_report(count(*)) over (), 2) pct
    FROM test_numbers
    GROUP BY fx_string_profile(phone_number_str)

    ------------------------------ -------------- --------------
    999-999-9999 551 55.16
    9999999999 118 11.81
    999 999 9999 32 3.2
    999.999.9999 26 2.6
    (999) 999-9999 21 2.1
    999-999-9999 x999 16 1.6
    ...many, many more here...
    [99](9)9999-9999 1 .1
    (99) 9 9999 9999 1 .1
    ...many, many more here...
    999-999-9999 1 .1

    As you can see, I've replaced numbers with the number "9", capital letters with a capital "X" and lower case letters with a lower case "x". All other characters I've left in the string as these other characters have been important for the use cases I've implemented this code for in the past. If you've got a good eye (or you've drank plenty of coffee today) you'll notice that the last result looks very much like the first record. I took another look and noticed that this particular record had trailing spaces and my PL/SQL function doesn't account for that. So, it's not perfect but like I said, this code is right out of my toolbox and I've used it before on close to a dozen or so projects, so although not perfect it's still very helpful.

    A second approach...
    There's always two ways to "skin a cat" and for those of you are feeling ambitious today, this one is for you. In this example I've used the native Regular Expression (RegEx) capabilities in Oracle. Take a peek:
    SQL> SELECT  regexp_replace(
    regexp_replace(phone_number_str,'[[:digit:]]', 9),
    '[[:alpha:]]', 'X') AS phone_num_profile,
    round(100*ratio_to_report(count(*)) over (), 2) pct
    FROM test_numbers
    GROUP BY regexp_replace(
    regexp_replace(phone_number_str,'[[:digit:]]', 9),
    '[[:alpha:]]', 'X')

    ------------------------------ -------------- --------------
    999-999-9999 551 55.16
    9999999999 118 11.81
    ...many many more...
    999-999-9999 X999 16 1.6
    XXXXXX 6 .6
    ...many many more...
    999-999-9999 1 .1
    999 999-9999 X 9 1 .1

    In this second result set, noticed how all the alpha characters (both upper and lower) have come back with an upper case "X". At this point in my RegEx career, I haven't been able to figure out how to correctly use the "match parameter" capabilities that I think might assist in correcting this (or perhaps the :upper: and :lower: classes would help?). This second example is perhaps all you need if you'd like to profile strings without installing any of your own custom PL/SQL objects. I've also seen results of string profiles created with RegEx expressions that look like "A9" - where the string is a nine byte string of alpha characters. I haven't seen a database implementation of this format yet and at this point I haven't had a need to write one. If your interested in RegEx within the Oracle database, I'd recommend you pick up the "Oracle Regular Expressions" pocket reference, it's quite useful.

    Final thoughts
    If you find yourself doing tasks like this very often, there are "Commercial Off The Shelf" (COTS) products that take this to a level far beyond what you could ever do yourself. These tools are called "Data Profiling" tools and vendors like DataFlux and Trillium are excellent choices. Both have very intuitive user interfaces and are extremely powerful tools that can immensely benefit organizations who understand the value of a Data Profiling. Many times the need for a more powerful tool such as these data profiling tools even spark cultural changes in an organization and Data Governance and Quality programs begin - which is what happened at my current shop - but that's another story (blog entry).

    Until next time...Rich
  • Comments

    James Standen said…
    Great post Rich- this kind of data format analysis is extremely useful.

    And its often amazing how many variations users are able to come up with.

    Every time I see a data set that needs this kind of analysis, I think to myself- can we try to put constraints on the field at the point of entry? Of course, thats not easy either since in different parts of the world different formats are valid- but its a battle that is best fought as early as possible.

    That being said- techniques such as those you show here will always be needed. Good stuff.
    Great post Rich! Informatica also has a product that does character and token profiling called Data Explorer that lets you define tokens and profile various data types
    Dylan Jones said…
    Great post Rich, I'm a big fan of developing scripts internally to cut costs, get some momentum and identify some low-hanging fruit to fund more advanced technology.

    At the risk of shamelessly plugging some of my own Oracle and MS-Access pattern analysis scripts, they are all covered here, it's all open sourced code, feel free to adapt and extend: