The "Mosaic Effect"...


A few months ago I had the privilege to sit in on a presentation by an architect at the data.gov website. It was quite an interesting presentation and I was very glad I took some time out of my schedule to see what they were doing, where they started and what was next. During the question and answer portion of the presentation, almost every single question asked to the presenter was about the quality of the data available on the site or about the security aspect of making all this data available to anyone with an Internet connection.

The presenters answers to the data quality questions were almost exclusively centered around the fact that data available on the site was provided by the individual agencies and the individual agencies needed to be responsible for handling the quality of the data. In short, data.gov was just making the data available, if you had a problem with the data, go see the folks who created it. Makes some sense I guess, if they got hung up on worrying about all the data content they would never have gotten this far, not sure I agree with it, but I guess it makes sense. I’ve mentioned in previous posts that I wished there was some way to provide feedback and viola, they’ve recently added the ability to “rate” the data sets by four ratings (overall utility, data utility, usefulness and ease of access).

Answers to the security related questions were certainly more thoughtful. The presenter noted that all the data sets available on data.gov were readily available at the different U.S. Government agency websites already, all data.gov was doing was making them much more easily accessible. It was clear however that security was on the mind of the presenter, the tone of the conversation became much more serious during this part of the discussion. What was the primary security concern of the presenter? It wasn’t that an agency was going to post a single data set which would compromise security, he didn’t think anyone was that foolish (this was pre-WikiLeaks). The primary concern was that someone could merge multiple data sets together, piece them together if you will, to build a new data set which would in fact compromise security. The presenter called this scenario the “Mosaic Effect” and defined it as follows (paraphrased): “The Mosaic Effect is when seemingly innocent (the presenter used the word innocuous) bits of data while by themselves are not a security concern, may reveal secure information when combined.”



In 2004, ComputerWorld defined the Mosiac Effect as (paraphrased): “How a combination of seemingly innocuous bits of data can create a privacy breach when combined”.

In an statement on the CIO.gov website in early 2010, Vivek Kundra stated: “Individual pieces of data when released independently may not reveal sensitive information but when combined, this “mosaic effect” could be used to derive personal information or information vital to national security.”

The “Mosaic Effect” hit home recently when a friend sent us a link to an online phonebook which aggregates data about people and makes it available via an online search engine. There are probably dozens of sites now which do this but this particular site provided the following bits of information on my name: address, approximate age, approximate household income, approximate home value, hobbies, other household members, YIKES... Apparently they knew more about me which they were willing to share if I was willing to cough up $36 a year for full access to their database and search capabilities.

I’m not a privacy nut but we’re moving into an age where just about any information you want to find out about a person could be found out with an Internet browser. Data aggregation websites are exploiting this “Mosaic Effect” culling data together from social networks, online auctions, online real estate and tax databases and wherever else they can get grab information about people. Like it or not, sooner or later you won’t have to ask someone “boxers or briefs”, you’ll pay $36 to some random data aggregator and you’ll find out the person’s waist size too.

Until next time...Rich

Comments

Unknown said…
Rich
Great post. It is worth pointing out that in Europe, the legislation re: Personal Data Protection defines "Personal Data" as being anything which can identify an individual or which, if combined with other data likely to come into your possession, would identify an individual.

So the Mosaic effect was anticpated with that legal framing.

Of course, people still screw up all the time!
Anonymous said…
Rich,

An excellent post which touches on many areas/prompts more thoughts:
* Daragh correctly points out the data protection legislation already provides some mechanisms to overcome the mosaic effect
* If there are DQ problems with individual tiles in the mosaic, then you may end up with a picture that is not correct, but may not obviously be so
* If common identifiers are not used consistently, then it is possible that the mosaic ends up being a collection of small mosaics which do not clearly join up
* Publication of large volumes of such data can be misleading due to what is left out (as it may not meet thresholds for significance, or may be too sensitive). The UK government recently published large volumes of expenditure data for government departments, however, analysis of this will be difficult due to the complexity of the overall data sets, a lack of clarity on the completeness of the data and of the accuracy of the data.
* As Henrik Sorenson mentioned in his tweet of your post, it looks to Europeans (and others) as if your post was written on the 1st August 2011, so may have involved some time travel.

It makes me wonder what the world would be like if all this data could be linked up and made meaningful - governments would have an even tougher time behaving the way they currently do and it would make the current Wikileaks disclosures seem like a storm in a teacup by comparison.

Keep up the good work


Julian
Murnane said…
Hey guys, I changed the date format!