An Application Of Bloom Filters At Google

Joe Marshall, who is an old time Lisper now working at Google, has an interesting story about the use of Bloom filters at Google. The story begins when Marshall noticed an anomaly in the response time graph of one their search appliances. There was an odd blip at 300 ms across several machines that was independent of the search string.

Marshall soon narrowed the problem to the spell server that asks you “Did you mean X” when you misspell a word in your search string. Here’s how that works. When you enter a search string, each word is checked to see if it’s in the list of “known misspellings of legitimate terms”. This is done by means of a Bloom filter. The thing about Bloom filters is that if they tell you an item is not in the list then that item is definitely not in the list. But if it says the item is in the list, the item may or may not be there. If a spell server term is flagged as a misspelling, a longer running process is invoked to find the misspelling and suggest alternatives. Obviously, misflagged terms increase response time.

A good Bloom filter is designed so the probability of a false positive is minimized. Marshall gives the basic idea of Bloom filters but they’re actually a bit more complex because they use multiple hash functions. Still, his explanation tells you what you need to know: It basically hashes each word into a bit array that is false if the term is not there, and true if any term hashed to that bit location. Obviously the bigger the bit array, the less chance of a collision. You can check the linked Wikipedia article if you want the details.

Marshall’s problem turned out to be that the bit array wasn’t big enough so even correctly spelled words were being flagged. When he increased the bit array size the problem went away and the 300 ms blip disappeared.

This entry was posted in Programming. Bookmark the permalink.