What data do we have?

January 05, 2010

What data do we have?

What data do we have?

Such a simple question and one any senior executive or manager might ask. In most organizations, an executive asking this question to an average data geek would receive an answer so complex that their eyes would gloss over in about twenty seconds.

I've been pondering this type of question for a very long time and yet I still don't know a very good way to answer it.

In the past I've built data dictionaries in MSExcel and socialized this with folks who might have asked me the question. The response is usually a "thanks" but deep down I know they are like "huh?". These .xls files typically contain a list of tables and fields with associated metadata such as short descriptions of the tables/fields and data types and sizes of the fields. Much too granular.

I've built a "Information Catalog" in PHP which works for me at the very granular (table/field) when I'm doing database development work. Most folks outside someone who is developing on that database probably wouldn't "get" this tool though. The tool took the .xls version of the data dictionary to a whole new level as it allowed for easy loading, quick entry of descriptions and flickr-like tagging capabilities at the table level. Again, very useful for me but "gobbledygook" for an executive as it's really just a step up from the .xls solution.

I've never worked with a formal metadata repository but I've seen enough of the dog-and-pony shows by those vendors to know that they aren't going to really answer the executives simple question.

I wish I could find some examples of cases where folks have illustrated or inventoried their "data assets" in a fashion most senior executives would understand. I'm talking "way" beyond (or is it way "above"?) a simple list of databases and their tables and their fields. Something that a CEO can go and show their other CEO friends and say, "Hey fellow golf buddies, look what I got from my data geeks, my data geeks are better then your data geeks!".

We really need to figure out a way to inventory and illustrate our data assets so that we could understand our data. Such an inventory could answer fundamental questions like:

- Where do these data assets physically reside (servers, database names/types, etc.)
- Where does this data come from?
- Where is this data used and where does it go?
- How good is this data? Are their data quality issues and if so are they chronic?
- Who are the resident experts for this data?
- What did/does it cost to obtain and manage this data?
- Can we sell our data?
- ...insert a slew of additional questions here...

& last but not least...

- What data do we have?

If anyone out there has had success beyond simple data dictionaries I'd love to hear about them.

Until next time...Rich

Comments

data quality chronicle said…

Great post and an even better challenge Rich! I relate because I often go through this challenge with clients when looking to de-duplicate customer data. Often there is not enough sufficient data to accurately do it and most (excluding some data geeks) are completely unaware.

I'll see if I can add to your data dictionary model and come up with some good ideas.

Jan 5, 2010, 3:49:00 PM

Thorsten said…

Strange how something so familiar to us "geeks" such as a database model can be completely "over the top" for executives.
We've tried something we called a "Data Map", basically a very highlevel ER schema with all the company's data that fits on a single PowerPoint slide. It's part of something I want to discuss on my blog in the future, but I could send out a sample or do a rough post if anyone's interested.

Jan 5, 2010, 4:08:00 PM

James Standen said…

Great Post.

I think part of the solution is presenting the information in very visual ways. A picture really is worth a thousand data models.

I think you've put your finger on one of the key issues facing data people today.

Jan 5, 2010, 5:02:00 PM

paulboal said…

I've done things like what Thorsten describes, too. They're the most effective tool I've used. Unfortunately, they also have two common problems. Because they're disconnected from the reality of the details, they get out of sync with what data is really where. They have to be maintained separately and that often is too much "overhead" to keep up long term. Second, they often get tweaked into something that hides critical risks or issues in how the underlying data or systems actually work. Again, because they're disconnected from the real details, they end up representing a little bit of how someone wishes things worked rather than how they really work. Maybe not intentionally, but I've seen that happen plenty. Things are sometimes simply messier than we want them to be.

Abstraction is for hiding the details, not lying about them.

I wonder if there are any good ways to take the detailed metadata that we collect in ER models and generate a higher level picture. Maybe some kind of mind map or a word cloud that would help a savvy-but-not-technical executive understand more easily. Interesting experiment... watch for a blog post from me in the next few days...

Jan 5, 2010, 9:18:00 PM

Charles Blyth said…

Great post Rich,

The best way I have found to handle this question is to turn it around.

Respond with "It's not about the data we have, it's about the Information we have"

Talk about your data in the context of information and the business will understand it. If they don't, they probably never will...

Execs don't ask how many widgets they have for production of XYZ product, they ask what the production capacity is to produce XYZ product. Treat discussions about data in the same way and you will get their attention.

Cheers

Charles

Jan 6, 2010, 5:21:00 AM

Sheezaredhead said…

Great Post Rich! What we have done at our organization is we use a wiki to collaborate on and communicate this information. The wiki provides the following business value:
1/ the information that is important (such as definitions, rules, stakeholders, issues) is easily accessed and searchable
2/ it allows for both business and IT to collaborate on potential changes to the rules etc,
3/ it can easily be maintained and therefore the information is kept up-to-date.
4/ We can track the usage (or hits) each term receives which tells us which information the users are most in need of.
We know this provides value to our users because in 1 year the hits to this information has increased by 200%.
This strategy doesn't meet all the needs you identified but it has helped us communicate, engage and raise awareness of data quality.
Thanks very much

Jan 6, 2010, 7:59:00 AM

paulboal said…

Sheezaredhed - can you talk some more about what the adoption curve for your wiki looked like? Did you have champions in particular business areas that were part of the effort from the beginning; or did it start as a technical solution and slowly grow into a collaborative one? We've got a team exploring that and I'm concerned that they don't have any strong business champions to help make it sustainable.

Jan 6, 2010, 8:09:00 AM

Search This Blog

Rich Murnane's Blog

What data do we have?

Comments

Popular Posts

Levenshtein Distance Algorithm: Oracle PL/SQL Implementation using a two-dimensional array of numbers

Installing Oracle Text on Oracle 10gR2...