Implementing Strings

Andreas Zwinkau has a very nice article on strings and how to implement them. If you’re a C programmer, your first reaction might be, “Meh. They’re an array of bytes. What’s to implement?” If you’re, say, a Python programmer, you may not have thought about the question because, after all, they’re a built-in primitive type.

But there are interesting questions to be asked, even for the C programmer. As a single example, it used to be that strings were made up of a series of characters, each a single byte. In these days of Unicode, that’s not true even for the most parochial of middle American programmer with no intention of having users outside his town, let alone his country.

The first interesting question is string length. A simple byte count is no longer adequate when a character can be 1, 2, 3, or 4 bytes as they can in UTF-8. On the one hand, you need a byte count for low level operations such as allocating space for the string or copying one string to another. But very often, it’s the character count that the application programmer is interested in so you need to record both or have any easy way of calculating them.

Some programming language purists have railed for years about the traditional C method of marking the end of a string with a NULL byte and calculating string length by simply scanning down the string looking for the NULL and counting the bytes as you go. Yes, it’s computationally expensive and yes, it can be a source of subtle bugs but it never bothered me or—I’d wager—any competent C programmer. Regardless, it’s clearly not adequate in a Unicode world so other implementations are needed.

A large part of Zwinkau’s post is concerned with this question but there are other interesting questions as well. If you’re a programmer, you should definitely take a look at the article. It doesn’t reveal any earth shattering, new, magical methods but it does show how to solve some of the problems involving strings that few of us have to worry about.

This entry was posted in Programming and tagged . Bookmark the permalink.