Tying your data cleansing routines to your trouble ticket system

If your company is anything like my company, you have to run batch jobs to clean up data either on demand or via a scheduling tool like "crontab". When our team first purchased our favorite data quality tool to help us do these cleanups (DataFlux), we took a few moments and decided that we would need to keep track of any and all data that was "cleaned". We spent a little time at the whiteboard and came up with what we call the "Data Quality (DQ) Logging Framework".

The framework includes:
- A couple tables to be deployed to your database - a log table and a simple lookup table
- A standard trouble ticketing system
- A promise from everyone on the team that they will place a record in the log table each time they do any cleaning

Understanding your trouble ticket system
At my shop we use a fantastic tool to keep track of trouble tickets called JIRA. Each "ticket" in our JIRA system has a unique identifier which looks like ABC-99; where ABC is a short alias for a JIRA project and 99 represents the "ninety ninth" ticket created under this project. Where am I heading with all this? Well… at our shop when a client finds any type of issue in our application (albeit an application bug or a data quality issue) the customer service team member opens a ticket in our JIRA system to ensure the issue gets addressed. Most shops that have even just a "little bit mature" processes use a tool like JIRA. Some other ticket systems I've worked with include PeopleSoft, Bugzilla and Test Director and each one of these tools has a unique identifier for a trouble ticket, just like JIRA does. You need to take a peek at your ticket system so that you know what makes a ticket unique in that system. Most times it's very clear, if it's not you'll have to do some homework. Special note here: many times our team creates our own tickets when we know something's up – just so we can keep track of any and all cleanups within our JIRA system.

Moving on…
Imagine if you will your going to be cleaning up your "Customer" table – which has a field called "Customer ID" as it's primary/unique key. Here's a mockup of the DQ Logging Framework tables you can deploy so that your keep track of changes you and your team make on this "Customer" table.

Description of Fields:
1) Surrogate key for the "CUSTOMER_DQ_CHANGE_LOG"
2) Unique Customer ID: this will be the primary key of your CUSTOMER table which you're on your way to "cleaning up"
3) Log type code: this field is a foreign key out to a small lookup table. In our table we have codes for merging, another for deleting, etc… You'll need to go to your whiteboard to determine which types of codes you might need, but you'll probably need one for "records merged", another for "record deleted" and another for "record updated"
4) Change date: the date (and time) which you "cleaned up" this record
5) Issue Tracking Ticket ID #: this is your ticket number that should identify the problem you're trying to address. Again, this ticket could be generated directly from a customer request, or it could be generated from someone on your team – it doesn't matter who created the ticket, what matters is that you have a trouble ticket to go back to so that you know why the record was changed.
6) Description of change: we use this field for a short description of the issue, for example something like "cleaned up phone number".
7) Log type Ref table: log_type_cd (code) just a numeric value which keeps track at a high level of the type of change (see #3 above)
8) Description: a short description of the type of changes you make to your tables

Now what happens?
It's simple. In each cleanup job we write (in our case, DataFlux architect job), we insert a record into the logging framework to keep track of what we did - every record updated/deleted/merged/etc gets a record in the framework!

Well, we need to keep track of what we're doing for accountability, right? We need to communicate metrics to our senior managers (a.k.a. justify the costs of our groups and tools and processes). We need to be able to brag to folks at the water cooler so that everyone knows how much data we're "beating up".

What's next?
Use your imagination. At our shop we clean up transactional data and keep track of changes using tables just like the above. Each week we roll up the changes by customer and then illustrate in a dashboard tool which customer's data has been cleaned up and how "bad" it would be if we didn't clean up anything. There's a ton of things you can do with this data, imagine if you will that your running a business & your business is about cleaning up data & business is good. What metrics can you keep an eye on to make sure your business is running efficiently? Do tickets diminish for customer's who's data you've cleaned up? Do you have certain customer's that you keep cleaning up over and over? What other correlations can you cull from this, I'm pretty sure the sky is the limit.

Until next time…Rich