A very inexpensive way to profile a string field in Oracle

February 05, 2010

A very inexpensive way to profile a string field in Oracle

I've got a bag of Oracle PL/SQL tricks that I carry around with me from project to project and one of them is a simple little function I created to "profile" a string. Profiling strings is a useful technique when your not sure what data is contained in string that might contain alpha, numeric and and when your wondering if their are patterns in the way strings are constructed.

Some examples might be:

Phone Numbers stored as strings
Phone numbers stored in databases are typically very unstructured. Do they have the "country" code? Do they contain parenthesis or dashes?

Product ID's
Many times you'll find Product ID's type fields that are supposed to comply to certain rules (i.e. Two alpha characters followed by 10 numeric characters).

U.S. Social Security Numbers
Please tell me you don't store these in your database, but if must do so in fact you can take a peek at how they are constructed with the below function and query.

...etc...

Here's the code

--fx_string_profile.sql
CREATE OR REPLACE FUNCTION fx_string_profile(
    p_str               VARCHAR2) 
RETURN VARCHAR2 IS
    lv_upper_char       VARCHAR2(100) 
                := 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
    lv_lower_char       VARCHAR2(100) 
                := 'abcdefghijklmnopqrstuvwxyz';
    lv_number_char      VARCHAR2(100) 
                := '0123456789';
    lv_upper_char_out   VARCHAR2(1) := 'X';
    lv_lower_char_out   VARCHAR2(1) := 'x';
    lv_number_char_out  VARCHAR2(1) := '9';
    lv_char             VARCHAR2(1);
    lv_string           VARCHAR2(2000);
BEGIN

  IF      p_str IS NULL
  THEN    RETURN NULL;
  END IF;

  FOR i IN 1..length(p_str) LOOP

    lv_char :=  substr(p_str, i, 1);
          
    IF      instr(lv_upper_char, lv_char) > 0
    THEN    lv_string := lv_string||lv_upper_char_out;
    ELSIF   instr(lv_lower_char, lv_char) > 0
    THEN    lv_string := lv_string||lv_lower_char_out;
    ELSIF   instr(lv_number_char, lv_char) > 0
    THEN    lv_string := lv_string||lv_number_char_out;
    ELSE    lv_string := lv_string||lv_char;
    END IF;

  END LOOP;
      
  RETURN lv_string;

EXCEPTION
WHEN others THEN RAISE;
END;
/
SHOW ERRORS

In action
In this example, I've loaded up about a thousand phone numbers from a real world database where most of the numbers are "US centric". As you can see from the results of the query, the formats (or "string profile") of these phone numbers are really all over the map.

SQL> CREATE TABLE test_numbers (
  2      phone_number_str    VARCHAR2(100));

Table created.

SQL> @i.test_numbers.sql
999 rows created.

SQL> COMMIT;
Commit complete.

SQL> SELECT  fx_string_profile(phone_number_str) 
                AS phone_num_profile,
           COUNT(*) RECORDS,
           round(100*ratio_to_report(count(*)) over (), 2) pct
     FROM    test_numbers
     GROUP BY fx_string_profile(phone_number_str)
     ORDER BY 2 DESC;

PHONE_NUM_PROFILE                     RECORDS            PCT
------------------------------ -------------- --------------
999-999-9999                              551          55.16
9999999999                                118          11.81
999 999 9999                               32            3.2
999.999.9999                               26            2.6
(999) 999-9999                             21            2.1
999-999-9999 x999                          16            1.6
...many, many more here...
[99](9)9999-9999                            1             .1
(99) 9 9999 9999                            1             .1
...many, many more here...
999-999-9999                                1             .1

As you can see, I've replaced numbers with the number "9", capital letters with a capital "X" and lower case letters with a lower case "x". All other characters I've left in the string as these other characters have been important for the use cases I've implemented this code for in the past. If you've got a good eye (or you've drank plenty of coffee today) you'll notice that the last result looks very much like the first record. I took another look and noticed that this particular record had trailing spaces and my PL/SQL function doesn't account for that. So, it's not perfect but like I said, this code is right out of my toolbox and I've used it before on close to a dozen or so projects, so although not perfect it's still very helpful.

A second approach...
There's always two ways to "skin a cat" and for those of you are feeling ambitious today, this one is for you. In this example I've used the native Regular Expression (RegEx) capabilities in Oracle. Take a peek:

SQL> SELECT  regexp_replace(
              regexp_replace(phone_number_str,'[[:digit:]]', 9),
              '[[:alpha:]]', 'X') AS phone_num_profile,
          COUNT(*) RECORDS,
          round(100*ratio_to_report(count(*)) over (), 2) pct
  FROM    test_numbers
  GROUP BY  regexp_replace(
              regexp_replace(phone_number_str,'[[:digit:]]', 9),
              '[[:alpha:]]', 'X')
 ORDER BY 2 DESC;

PHONE_NUM_PROFILE                     RECORDS            PCT
------------------------------ -------------- --------------
999-999-9999                              551          55.16
9999999999                                118          11.81
...many many more...
999-999-9999 X999                          16            1.6
XXXXXX                                      6             .6
...many many more...
999-999-9999                                1             .1
999 999-9999 X 9                            1             .1

In this second result set, noticed how all the alpha characters (both upper and lower) have come back with an upper case "X". At this point in my RegEx career, I haven't been able to figure out how to correctly use the "match parameter" capabilities that I think might assist in correcting this (or perhaps the :upper: and :lower: classes would help?). This second example is perhaps all you need if you'd like to profile strings without installing any of your own custom PL/SQL objects. I've also seen results of string profiles created with RegEx expressions that look like "A9" - where the string is a nine byte string of alpha characters. I haven't seen a database implementation of this format yet and at this point I haven't had a need to write one. If your interested in RegEx within the Oracle database, I'd recommend you pick up the "Oracle Regular Expressions" pocket reference, it's quite useful.

Final thoughts
If you find yourself doing tasks like this very often, there are "Commercial Off The Shelf" (COTS) products that take this to a level far beyond what you could ever do yourself. These tools are called "Data Profiling" tools and vendors like DataFlux and Trillium are excellent choices. Both have very intuitive user interfaces and are extremely powerful tools that can immensely benefit organizations who understand the value of a Data Profiling. Many times the need for a more powerful tool such as these data profiling tools even spark cultural changes in an organization and Data Governance and Quality programs begin - which is what happened at my current shop - but that's another story (blog entry).

Until next time...Rich

Comments

James Standen said…

Great post Rich- this kind of data format analysis is extremely useful.

And its often amazing how many variations users are able to come up with.

Every time I see a data set that needs this kind of analysis, I think to myself- can we try to put constraints on the field at the point of entry? Of course, thats not easy either since in different parts of the world different formats are valid- but its a battle that is best fought as early as possible.

That being said- techniques such as those you show here will always be needed. Good stuff.

Feb 5, 2010, 10:04:00 AM

data quality chronicle said…

Great post Rich! Informatica also has a product that does character and token profiling called Data Explorer that lets you define tokens and profile various data types

Feb 8, 2010, 9:43:00 AM

Dylan Jones said…

Great post Rich, I'm a big fan of developing scripts internally to cut costs, get some momentum and identify some low-hanging fruit to fund more advanced technology.

At the risk of shamelessly plugging some of my own Oracle and MS-Access pattern analysis scripts, they are all covered here, it's all open sourced code, feel free to adapt and extend:

http://www.dataqualitypro.com/data-quality-home/best-of-technical-data-quality-tutorials.html

Feb 11, 2010, 5:12:00 AM

Search This Blog

Rich Murnane's Blog

A very inexpensive way to profile a string field in Oracle

Comments

Popular Posts

Data Architect - DC Metro Area (Annapolis MD)

Levenshtein Distance Algorithm: Oracle PL/SQL Implementation using a two-dimensional array of numbers