Are we ready for all this data?

As the year winds down I've been thinking about what has changed for us data geeks over the last year and what will be continue to change over the next year.

I'm pretty sure the #1 thing that's changing for us folks working with data is the amount of data we're looking at. Some news stories today make it very apparent that we're probably not quite ready for all this data.

In this first article, a kid from California downloaded 1.4 million KiloBytes of data to a cell phone and his father's response was "They shouldn't allow this to happen". You know what, the father's right. Clearly Verizon does not have any measures in place to ensure folks aren't downloading way above "normal" amounts and as such "Joe the plumber" is getting cell phone bills for $20k. Why does Verizon let this happen? It's a simple answer, their not "ready" for the explosion of data they are experiencing.

Next in the news we have lots of press illustrating just how ill-prepared AT&T is for the data bandwidth requirements their iPhone users need. I'd imagine that behind the scenes AT&T is not only concerned about the bandwidth issue, they are more then likely trying to recover other costs as well for managing all the data generated from iPhones downloading all that data (if that makes any sense). Imagine if you will that each phone call, text message, email, website visit, document/image/video download generates records somewhere in log files and databases so that folks can be billed appropriately (see Verizon comments above). With the addition of the iPhone AT&T's communication bandwidth requirements grew by 40% and I'd imagine their backend office data requirements grew at a significantly larger rate then that. Were they "ready"? Sounds like they weren't and that's why they are scrambling now to recover costs. "Limiting" iPhone bandwidth will do nothing but force Apple to go to other providers like Verizon, think they are ready?

This next posting reiterates that even Jim Gray was concerned about the amount of data we're going to need to manage. If a brilliant mind like Jim Gray was concerned about the data explosion then us mortals should be scared half to death.

At the DataFlux Ideas show in October Tony Fisher stated something like data would be growing at a rate of doubling every three days. I could be misquoting him here, but I think you get the picture. Clearly we need to be ready for significant growth in data demands.

Sometime in 2004 I was managing a database which was growing at a very sharp rate very quickly. When I say quickly, I mean that we were loading tables up with records at a rate of up to a million records a day. At the time I reached out to the vendor to see if there was anything "special" I should do from a DBA perspective and the answer I received from them wasn't very helpful. Either they weren't concerned or "weren't ready" for questions like this. I wasn't really ready myself either. I tried implementing a few tricks which helped a little bit, but I can't help from thinking and feeling like we were ill prepared and the applications and users who needed the data suffered.

At this point we need to assume all our databases and other data sources are going to experience explosive data growth if they haven't started feeling the pain already. The most significant challenge with respect to all this data is going to be that the expectations for the quality of the data are going to be the same, even though the amount of data has exploded. I'm contracting with a client who loads up a database with approximately 30 million transactional records a week. Interestingly enough the client is extremely concerned about the 100 or so transactions which have exceptions and are not loaded. One hundred exceptions in a load of 30 million transactions is a "five nine's" success rate (99.99967%). Communications networks and computer hardware which run at such rates (called Service Level Agreements or SLAs) are touted as extremely reliable and successful. The perception however for data is that it needs to have even better reliability rates then other traditional IT systems.

It's unfortunate that when we provide a detailed report to clients and they find one single error in the report, they will question the integrity of the entire report. We can't really blame our clients though, can we? If you received a $20k cell phone bill you'd be upset too.

Data Quality needs to be addressed up from in the start on any data project and the project needs to have contingencies in place to help prepare for explosive data growth. One error among a thousand records or a billion records causes the same grief and concern for anyone out there, it's just human nature.

The better people, processes and tools we have in place from the start of a project the better our data is going to be because we'll be ready. For projects already under way each new iteration should have some time "baked into" the change for not only SQA testing but for data quality tasking. We're in for a good fight and I say "Let's get ready to rumble!".

I'm going to be essentially "offline" for the next two weeks for the holidays. To each of you out there who might stumble across this I wish you the best in 2010.

Until next year...Rich

Comments