Wednesday, November 25, 2009

A data geek's list of things he's thankful for...

With Thanksgiving upon us folks here in the states tomorrow, I thought I'd list out a few of the data related things I'm thankful for:


My team: Enterprise Data Operations Team
Both my direct reports and the rest of the team here at my shop are fantastic. My company's website lists my immediate team's credentials as follows: "Managing data from more than 2,000 sources and creating reports essential for business decision requires laser-focused attention to detail, commitment to data quality and unwavering dedication to data privacy. Our enterprise data operations team represents more than 85 years of data management excellence, with in-depth knowledge in a variety of database systems across various industries. This training and experience enables our team to integrate data from various sources into a centralized database of record, ensuring that clients’ data is accurate and accessible when they need it the most."

My "online friends"
New to me this year is the relationship I've made with some kindred spirits in the online "data world". I've found following these folks on Twitter and reading their blogs has changed the way I work. Finding inspiration from folks who are working on very similar problems perhaps even brought back some of the fun for me into this crazy niche world of data I find myself in each day.

My "going against my inner introvert self"
I've really made an attempt in the last year or so to branch out socially and I'm very thankful for it. More details are:

  • I'm thankful for the #DataQuality and #DataGovernance tags on Twitter.
  • I'm thankful for keeping touch with my professional friends on LinkedIn and personal friends on Facebook (as well as the ability to separate the two).
  • I'm thankful I was able to attend both a DataFlux training class and the DataFlux Ideas conference in 2009.

    My "technology stack"
  • I'm very thankful and lucky with my primary organization's (and my primary contracting organization's) technology stack. I've been working on Oracle/Linux for a long time now and knowing the ins and outs of the technology has been extremely helpful.
  • I'm extremely thankful of the data quality tool we're using (DataFlux) and have been very impressed with the tool since we purchased it over a year ago now.
  • I'm very thankful and happy with this silly little open source ETL tool called Kettle (Pentaho's Data Integrator).
  • I still find myself dumping data down to MS Excel more often then I care to admit but I have to admit that I am thankful for the ability to do so. At the end of the day as long as I don't share these files with anyone I might as well use it for what it's good at.
  • I'm thankful the ability to whip up a quick website using PHP.
  • I'm thankful for Perl and regular expressions (well, the few things I know).
  • I'm thankful for sites like askTom, because even when you think you know what you need there are days when your humbled by folks like Tom who make you feel like your a newbie.
  • I'm thankful for FireFox which has made all our lives better by getting us out of the IE funk we were in for so long.
  • I'm thankful for ERWin - even though I'm using such an old version.
  • I'm thankful for VPN's - without which I'd be stuck in the office every single day and night (very sad but true).
  • I'm thankful for CygWin - makes my life easier when running Windoz
  • I'm thankful for a new mindmapping tool I started using this year, XMind, I'd recommend trying it out (free!).
  • I'm thankful for bit.ly - silly but I love it.
  • I'm thankful for version control - if your not using something for this for your work your crazy.
  • I'm thankful for my text editor (an old version of InterDev from Microsoft)
  • I'm thankful for StumbleUpon - it's the best way to procrastinate.

    Second to last I'm thankful for this silly little blog. It's fun posting things of interest out here, seeing "what sticks" and how I may have helped or intrigued other folks.

    Lastly of course I'm thankful for a great family, my health and the health of my wonderful family.

    Until next time...Rich

  • Friday, November 20, 2009

    Unless your perfect, expect some giggles when opening your (data) kimono...

    <disclaimer>I'm going to try to leave the politics out of this one, but I might cross that fine line - consider yourself warned...</disclaimer>

    Lot's of press today about the "Data Quality" issues found on recovery.gov.

  • Obama Administration Defends Its Data Quality
  • which references a blog post on the White House's Web site Looking at the Big Picture on the Recovery Act
  • another: Some don't report how stimulus funds spent

    The list unfortunately could go on and on and on...

    With all this said, I can't help but think to myself:
    "umm, what did they think was going to happen?"

    Increased visibility to your data allows people to find more data quality issues with your data. It's a very simple concept. I learned this lesson many years ago when building applications as a junior software developer for a large organization. We created this enormous database containing information about products in our industry. Things were going along fine and we were fat/dumb/happy with ourselves UNTIL PEOPLE STARTED LOOKING AT IT! Each and every time you expose more data to more and more people, the sheer number of questions about that data is going to skyrocket.

    "more data + more visibility = more questions"

    Every single time I've been involved in a project to put data into the hands of the masses we've gone into the project knowing that we were going to get some folks giggling at our "open kimono".

    It's a very difficult position to be in, I kind of feel bad for these folks (just a little bit). In Mr. DeSeve's posting he implores people to look at the big picture. This must be this guy's first time building anything like this because anyone and everyone in the data world knows that if you build a really slick "look and feel" reporting application but can't trust the data - people will not want to use the application.

    People will not and can not get past obvious data quality issues in reports. Any junior data analyst in the industry you ask should know this.

    With all that said, I'm shying away from stating that the program was or was not a good idea. I do in fact however think that the government was premature in posting this data to people without any quality assurance done on the data. The fact that someone could type in an incorrect Congressional district in this day and age (it's called referential integrity and most databases have had it for over a decade) is inexcusable. The fact that the government posted data with incorrect or missing districts is inexcusable. The fact that they can't tell us who has reported and who hasn't is also significantly concerning.

    The #'s wouldn't have been as glamorous, but why not call out attention to those folks who have not reported or who have reported incorrectly.

    At the end of the day the website had great intentions (show me my data) but will forever be associated with poor data. Here are some quotes that we've all heard a variations of before:
  • Many of the mistakes "don't undermine information at the heart of the data"
  • "the mistakes are RELATIVELY few, and don’t change the fundamental conclusions one can draw from the data."
  • "Some of the mistakes are frustrating typos and coding errors that don’t undermine information at the heart of the data"

    If anyone out there is interested in building a reporting application like this for an organization (large or small), be prepared to be giggled at when you open that kimono - unless that is - your "perfect".

    Until next time...Rich

  • Wednesday, November 18, 2009

    Single Version of the Truth? An online battle of wit and knowledge sharing for those interested in Data Quality (who wouldn't be interested?)...

    In the spirit of good fun and knowledge sharing, three of my favorite Data Quality WebCelebs (I thought perhaps I just coined a new term, but after googling I realized I'm not quite that clever) have decided to have a battle of the pen (well, keyboard) to determine who could write a better blog entry on the "Single version of the Truth".

    First off is Henrik L Sørensen's entry titled "Sharing data is key to a single version of the truth". This short-but-sweet entry uses a fantastic map analogy to back up what can be best summed up by his thesis sentence: "there is a break even point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align all known purposes.". He's got a very valid point and as practitioners of Data Management really need to think through the "how far should we go" concept. Very, very thought provocative.

    Second out of the gate we find Charles Blyth out of the UK posting a very witty (with very nice icon's - what else should we expect from someone out of the UK?) titled "Tell me the Truth!". The part I like most about Charles' posting is that it gives practitioners hints as to how they might approach this (ie. asking questions such as "Is there any ambiguity in this definition?"). If your getting into a MDM project and need to look out to the market place to ask questions or find someone who is succeeding in delivering on this holy grail, I'd strongly recommend reading Charles' posting and then head start reading his older ones as well.

    Last but not least we have a posting called "Beyond a “Single Version of the Truth”" by Jim Harris - proprietor of the OCDQBlog. Typically I despise when folks "name drop", but Jim Harris wasn't going down without a fight on this one and references Einstein, Obi-Wan, Mad Max, Thomas Redman - VERY impressive. In all seriousness, I think Redman hit the nail on the head (as does Jim by culling it out of the book) when stating "A fiendishly attractive concept is...'a single version of the truth'...the logic is compelling...unfortunately, there is no single version of the truth. For all important data, there are...too many uses, too many viewpoints, and too much nuance for a single version to have any hope of success. This does not imply malfeasance on anyone's part; it is simply a fact of life." After quoting Redman, Jim goes on to getting beyond the "single version of truth" (by using a Mad-Max analogy).

    As a practitioner of all things data (architecture/design/development, administration, governance and quality, etc.) I enjoyed this wonderful challenge very much. I found Henrik Sørensen's posting to very much stimulate deep thought on the topic. Charles Blythe's posting gave me the feeling that a Single Version of the truth can really be accomplished if only because folks like Charles have really done it and are getting it done today with today's tools and processes that are well documented.

    The "getting beyond" portion of Jim's posting however is probably the part that sold me on Jim Harris' posting as the winner of this fantastic battle royal. As practitioners, we have to get beyond the "single version of the truth". I'm not confident that folks outside our Data Geek Club really understand this slogan and it's one one of our primary "marketing" terms. If we focus on the "what's next" after the single version of the truth such as the benefits of "better" data I'm confident this industry is going to take off like a rocket.

    Thanks again to the three of you for taking your time to share your knowledge with us, we really do appreciate it.

    Until next time...Rich

    p.s. After voting, here are the current counts

    Friday, November 06, 2009

    Choices you have when receiving data that has data quality issues

    Found (via Twitter #DataQuality hashtag) a couple good postings on a blog called "Data and Process Advantage Blog" - here's some thoughts on the postings.

    1) Are you a #DataQuality ostrich? Do you know an ostrich?
    I enjoyed this entry and I do have a "small team assessing the completeness and validity of data using data profiling tools" but I'm hopeful we're not "avoiding doing data accuracy checking as it is seen as difficult".

    2) An earlier posting named "Approaches to data quality issues from suppliers" really caught my attention - because in fact my shop does get data from many-many different data sources and each day we have to make very difficult decisions on what to do when we receive data with data quality issues. I may have posted this image before, but I'm a firm believer that you have four choices when receiving data with data quality issues:








    Just like everything else in life, one size solutions don't fit all.

    In the blog entry, it sounds like the first organization followed the third approach. I'd guess however that they didn't follow through and keep track of changes as well as report those corrections back to their suppliers and to their clients or senior management. Had they done this they could have marketed the work they were doing as "value add" work as well as - well - just making the data better for the applications which use the data.

    The second organization followed the path which I find most folks follow - they rejected the data. One major difference - they "did not state the nature of the errors" - very bold move here. Most folks go round after round with their data suppliers when rejecting data. Round one typically looks like "fields B and C are bad". Round two typically looks something like "how can field D have this value if field M has that value". Round three typically looks like "are we receiving all the data we're supposed to be getting?". Not telling the data supplier where the problems are feels weird to me. Imagine filling out an online form for ordering a product and when you hit submit it tells you "somethings wrong, go find the error Bozo"? I guess that would work if I was paying the person to fill out the form, I'm going to have to try this out some time.

    If your pulling in data from data suppliers and you can't just plain reject the data, my thinking is that it's fair game to "add value" by correcting or applying default values. One note however - you MUST in fact keep track of the changes you've made so that everyone involved knows what was originally sent in and what data has "value added" from your team. Keeping track of such changes may or may not mitigate the risk of the "value add", I suppose this would have to be evaluated on a case-by-case basis.

    As for choice three in my picture, I wish I could leave it out, but it is a valid choice. Just leave it alone and load it up into a database, again you should be able to easily identify what data actually has data issues.

    If there are significant liability issues at stake for ensuring your data is accurate, you may not want to go down the path of correcting the data or you may want to try to add value. Either way, if you do in fact load it somewhere, please keep track of any changes you've made, you won't regret it.

    I'm looking forward to more posts on the "Data and Process Advantage Blog".

    Until next time...Rich

    Thursday, November 05, 2009

    Open Source BI/Reporting - two years later

    Today I received an email in my inbox from Pentaho and Accenture's Innovation Center for Open Source. The email was titled "Let Accenture show you how to save big on BI, with Pentaho". Here's a link if your interested: http://www.pentaho.com/events/20091027_Accenture_webcast/

    ...Pentaho, I think I've written about them before...

    After receiving the email and skimming through the webcast, I sent an email to the team here at my shop, containing something similar to the following.

    "Dear Team

    Close to two years ago we took a look around the open source reporting industry to determine which platform we should standardize on for our reports in the portal environment. At the time we chose Jasper reports instead of Pentaho as it looked like a leader, creating reports using iReports (desktop client tool) was similar to tools like Crystal/Actuate reports, and a few software development folks have had experience with deploying Jasper Reports in Java applications.

    With all that said I'm continually bombarded with emails like the below where folks are seriously looking at or investing in Pentaho. I am also seeing some industry traffic on BIRT, particularly with it tying into Actuate. I'm not seeing any industry traffic on Jasper.

    I took a 10 minute test drive on Pentaho's online BI demo and it's very impressive. I believe this used to be back ended by a JBOSS portal but I'm not sure if it is anymore. The following link illustrates the differences between the Community edition of Pentaho BI and their Enterprise Edition version: http://www.pentaho.com/docs/pentaho_bi_suite_enterprise_edition.pdf

    ....some other proprietary comments here....

    Rich
    "

    At my shop we've been using Pentaho Data Integrator for a couple years now and it's a very neat "free" tool. I checked Google Trends to see if I was crazy to see if Pentaho is really becoming more popular and sure enough it looks true (see chart below).

    Until next time...Rich

    Friday, October 23, 2009

    MS Access useful queries

    Quite a long time ago I created a list of useful queries which you could execute in MS Access. These queries are particularly useful when dissecting on of these guys to move to a production platform (which I find myself doing today).

    http://www.sqlquery.com/Microsoft_Access_useful_queries.html

    Around the same time I created a MindMap of some best practices for building MS Access db's. This might also be useful for some of us data geeks out there.

    http://www.sqlquery.com/Microsoft_Access_Best_Practices.pdf

    Until next time...Rich

    Wednesday, October 21, 2009

    Data Quality mentioned in a Gartner keynote?

    I started following Gartner's data guy - Ted Friedman on Twitter recently and yesterday he added the following tweet:

    "#dataquality directly mentioned in #gartnersym opening keynote -- survey show 75% of finance execs consider DQ an issue."

    I wonder how we could get our hands on that survey without paying crazy dollars to Gartner. What are the other issues on their minds, is this the most important, second, third?

    Perhaps we are in fact at a "tipping point" in the data quality industry. Maybe data quality is (rightfully) one of Gartner's buzz words this year? Taking a look at the job boards today tells us that there are in fact 670 jobs posted on Monster.com today which mention "data quality" and only 147 that mention "enterprise architect". If anyone out there reads this posting they are probably the type of person who knows that Gartner really pushed "enterprise architecture" and architects over the past few years, maybe we are in fact seeing a shift in focus?

    Here's a pic of some of the keywords I'm going to start tracking on Monster and Dice. Too bad I didn't think of this earlier as I'm just fascinated in trends, particularly those that matter to our profession.



    Until next time...Rich