The Problems with Gratuitously Collecting Data

Over at Neustar Research, Anthony Tockar has an interesting if terrifying post on the analysis of anonymized data from New York City's Taxi and Limousine Commission. The data, which was obtained through a FOIA request, contains details of every Taxi or limousine ride in the city for 2013. The details included the pickup and dropoff times and locations, the fare, the tip, and a anonymized identifier obtained by hashing the taxi's license and medallion numbers. The first thing that happened was that a hacker used a little thought and completely de-anonymized the data. Once you know how the anonymizing ID was calculated it's easy to see how this could be done.

Right away you know the annual earnings of the city's cab drivers but the massive privacy fail doesn't end there. Tockar, who's a graduate student in Data Science interning at Neustar was able to tease an unbelievable amount of information from the data. For example, he was able to identify the name, address, property value, relationship status, court records, and a profile picture of an individual who had been frequenting certain men's clubs. All this information was easy to obtain once Tockar had linked the clubs and a residential address. He was also able to trace the coming and goings of some celebrities from the city. Read his post for the details and how he was able to leverage the data to reveal all sorts of private information.

Tockar's solution to this is something called differential privacy. It's analogous to the selective availability that the U.S. government used to use to perturb the accuracy of the (commercial) GPS system. The idea is that the coordinates of the locations are perturbed by random noise so that the locations can not be tied to individuals.

To me, this misses the point. The problem with this sort of data is that someone will always abuse it. Even if the data hadn't been released, it was still available to city authorities who could use it to track individuals or otherwise spy on citizens' activities—probably without a warrant. Imagine a jealous spouse or partner, for example, who worked at the commission, had access to the data, and used it to track the object of his or her jealousy.

The only real solution to the problem is to not collect the data to begin with. Did the city have any real reason for its collection other than that they could? If asked they'd be sure to bring up the usual four horsemen but the relevant information is kept by the drivers in their trip logs and can be obtained from them in the rare cases that it's needed. Centralizing the data and subjecting it to FOIA requests is just asking for trouble. Trouble, that as Trockar demonstrates, is easy to find.

This entry was posted in General and tagged . Bookmark the permalink.