Regular Expressions in 30 Lines of C

I recently saw a pointer to a nice article by Brian Kernighan that describes some beautiful code written by Rob Pike for their book The Practice of Programming. As an example of the power of notation, they wanted to show a simple implementation of regular expression matching. They couldn’t use one of the existing implementations because they were too long so Pike sat down for an hour or two and whipped out a 30-line version in C.

In the article, A Regular Expression Matcher, Kernighan explains why he thinks Pike’s creation is “beautiful code.” It’s compact, clear, elegant and implements a simple example of a real tool. If you know C or can read it, you should definitely take a look at Kernighan’s article.

If, after reading the article, you’d like to know a bit more about how regular expressions are implemented, you should take a look at Russ Cox’s posts on the subject. The first two posts, in particular discuss some of the theory and show how to implement efficient algorithms. Efficiency can be important in certain edge cases where the implementations in Perl and many other popular languages can take exponential time. Compare that with Ken Thompson’s original implementation for the QED editor from the 60s. Thompson’s version is literally a million times faster than Perl’s for one pathological regular expression and even faster for others. You can read all about that and why Thompson’s version performs so much better in Cox’s first post. The second post has a virtual machine implementation that more or less duplicates Thompson’s QED algorithm that generated native code on the fly for the matcher.

This entry was posted in General and tagged , . Bookmark the permalink.