As long time readers know, I am a big fan of Russ Cox’s work. In particular, his 3 part series on implementing regular expression search is just outstanding. Now Cox has added a fourth article to the series that describes how he implemented Google Code Search.
There is, of course, too much code on the Web to do a grep over the entire corpus. Instead, they built a trigram index of the documents and then analyzed each regexp to decide which trigrams would need to be present in a document to satisfy the regexp. Then they used the trigram index to retrieve and grep just those documents. Cox’s paper discusses how they did the analysis—it more complicated than you might imagine—and presents some results on a smaller version for which the code is available. (Google discontinued Code Search so the standalone version is all there is).
One example is searching for “hello world” in the 3.1.3 Linux kernel. There are 36,972 files to search but only 25 of them contain the necessary trigrams: “ wo”, “ell”, “hel”, “llo”, “lo ”, “o w”, “orl”, “rld”, “wor”. That cuts the search time by 2 orders of magnitude.
This paper is a beautiful example of using some elementary data structures and some smart analysis to bring a huge problem down to a manageable size. I really recommend the article for anyone interested in programming.