One of the tough problems in ethical data gathering is how to collect statistics while respecting the privacy of those the data is being gathered from. It’s widely known that most forms of data anonymization are not robust and that it takes surprisingly little information to deanonymize it.
One method that does show promise is differential privacy. The basic idea is that the data is perturbed in such as way that its origin can not be reconstructed but such that statistical measures, such as mean values, can still be estimated. That allows the collection of aggregate data while protecting the individual targets of the collection.
Recently John Cook’s excellent Data Privacy Twitter feed had a pointer to an interesting post on the Microsoft Research Blog on how they collect user telementry anonymously using differential privacy techniques. The idea, they say, is to encode the data on the user’s device in such a way that the output of the encoding is approximately equally likely to have come from any other user. At the same time, the encoding allows the recovery of useful aggregate data.
The post describes a method of encoding the value of \(x \in [0,M]\) in a single bit in such a way that the mean can still be estimated. It’s a nice trick and the post is worth a read just to see how effective differential privacy can be.