Chris Wellons on Hashing

The first data structure I learned and really internalized was the hash table. I’ve been using them my entire career and, given when I started, that meant implementing them myself. This was before I learned C and even C doesn’t have a built-in hash table so it’s always been a matter of build it or do without.

These days, it’s hard to find a (newer) language that doesn’t have a built-in hash table or dictionary data type. There’s a downside to that, however. For many younger engineers, hash tables are mysterious entities that seem like magic. There are, in fact, many strategies for implementing a hash table each with their own tradeoff and which is best depends on the particular application. The problem with a built-in type is that it’s one size fits all.

Chris Wellons writes mostly in C so he has implemented scores of hash tables. He’s settled on his own implementation method which is basically open addressing with double hashing. You can check his post for what exactly that means but it’s one of the main implementation methods. I’ve always preferred the chaining method, which is usually faster but takes more storage. Like Wellons, I’ve done scores of implementations.

Why do that? After all, most languages have a built-in type and even C probably has a library that implements hash tables. Lots of people think there IS no point but I disagree: if you know how things work, you know how to pick the best implementation strategy for your particular application.

Take a look at Wellons’ post for some of those details. Section 6.4 of Volume 3 of Knuth’s AOCP has all the details on the various methods and their tradeoffs. If you’re serious about software engineering, it will repay your attention.

This entry was posted in General and tagged . Bookmark the permalink.