In a nice followup to yesterday’s post about differential privacy, John Cook, the proprietor of the Data Privacy Twitter feed, has a nice post that uses Bayes’ Theorem to implement a simple differential privacy scheme.
The problem is to gather a yes/no answer to a question from a group of people without revealing any individual’s “true” answer. The post is nice because the method it uses is a simple application of Bayes’ theorem, making the mathematics accessible to most Irreal readers. The protocol involves each respondent flipping a coin once or twice to determine their answer—see Cook’s post for the details.
Differential privacy appears to be emerging as the go-to solution for anonymizing data in a way that protects the privacy of the respondents but still allows useful aggregate data to be extracted from the information. Even the U.S. Census Bureau will be transitioning to differential privacy for the 2020 census. If you’re interested in such things, the Census Bureau has a paper describing some of the details.
Last Minute Addition:
Just as I was getting ready to publish this, I came across this talk by John Abowd of the Census Bureau that talks about their plans for the use of differential privacy. The talk is an hour long so here are a couple of interesting points if you don’t have the time or interest to watch it. Abowd say that noise infusion is an absolute necessity to protect the data. As an example of why this is true and of how easy it is to identify respondents from limited data, the bureau combined commercial data bases with just 3 items from the 2010 census data: age, sex, and census block (a relatively small area that the bureau uses to group respondents by location). That was enough to uniquely infer the identity of the respondent for 48% of the population. When those results were checked against the actual data, it turned out that the inference was correct for 38% of the population.