Deanonymizing Metadata

I’ve written a couple of times about the New York City Taxi Commission’s metadata and how easily it can be abused even though it was anonymized by removing the personally identifiable information. The anonymization notwithstanding, data analysts were able to recover a plethora of personal information attributable to specific individuals. You can read the two posts above to see how easy this was.

Now the New York Times is reporting on the problems of anonymous metadata. They report on a study published in Science showing that anonymous credit card metadata could be deanonymized over 90% of the time if the analyst had 4 pieces of outside information about a person in the anonymized metadata. Even social scientists, who relish this type of data and use it in their research, are concerned about the privacy issues that it represents.

Much this data is collected as a side effect of ordinary business practices. For example, credit card data has to be collected so that the credit card user can be billed and the merchant reimbursed. There is, however, no reason to release this data in any form, no matter how much social scientists might wish to study it or advertisers wish to leverage it for targeted advertising. The data belongs to the customers and no one else. Nothing compels companies to make this sort of data available and it should, in fact, be illegal to do so.

The situation with data collected by the government is more complex. It is almost always subject to FOIA requests and therefore subject to release no matter what the holders or subjects of the data might prefer. That’s what happened with the NYC Taxi Commission data: someone filed a FOIA request for the data and then promptly deanonymized it. It’s certainly the case that the government must collect some data. Income tax returns, for example, are full of sensitive personal information. That’s why they’re specifically exempted from FOIA and why it’s illegal to release the data to anyone.

Sadly, though, the government collects lots of data that (a) is not protected from FOIA and (b) probably doesn’t need to be collected. Much of the data is collected simply because technology makes it cheap and easy to do so and because it might, someday, be useful. As the Science report makes clear there are no effective standards in place to guarantee the safety of data so in the absence of a compelling reason to do so, it shouldn’t be collected in the first place.

Of course, lots of people want to get their hands on that data and almost never for the benefit of the people it was collected from. For that reason it almost certainly will continue to be collected. The people who want it have deep pockets and will ensure to flow isn’t shut off. That’s too bad for the rest of us.

This entry was posted in General and tagged . Bookmark the permalink.