Checking A Word List

Daniel Lemire has a post that poses an interesting challenge. Suppose you have a list of words—Lemire uses “ftp”, “file”, “http”, “https”, “ws”, “wss”—and want to check if some other word is in the list.

Unless you believe in Zawinski’s law, you might choose regular expressions, especially if you’re writing in Elisp. But as Lemire points out, that’s going to be pretty slow. My first thought was to use regexp-opt to generate an optimized regular expression. For the list above, regexp-opt returns the regex "\\(?:f\\(?:ile\\|tp\\)\\|https?\\|wss?\\)", which will be faster than the obvious regexp but probably still not that fast if the check is going to be made many times.

Lemire offers some other solutions—in C++—including the obvious one of just sequentially checking the list. For a longer list, that could be sped up with a binary search but probably wouldn’t help much with the short list we have. He also offers some other solutions based on padding the input string. He has some benchmark results if you’re interested in the relative times.

One solution that he doesn’t mention is the one I would think of first: check inclusion with some sort of hashing operation. The obvious solution of that sort is to just put the list words in a hash table and check the input word by looking it up. That’s going to be very fast, especially if the list is long.

Perfect hashing seems like another good solution for a short list and will be fast to eliminate a candidate word but if you get a hit, you still have to check for a match. In any event, since the word list is known beforehand, you can adjust whatever hashing strategy you use to be as fast as possible. String comparisons are probably going to be the bottle neck is any reasonable solution so it’s important that the solution minimize those as much as possible.

It’s a nice problem and well worth thinking about.

This entry was posted in Programming and tagged . Bookmark the permalink.