From the excellent TeX Tip twitter feed, I found a pointer to an excellent article by Alessandro Segala concerning something I didn’t know much about and that many Irreal readers might not know about either: how and why to normalize unicode. If you’re like me, you haven’t even heard of normalizing Unicode before.
The problem stems from the fact that many Unicode characters have more than one representation. The example that Segala gives is the character ë. It can be represented as either a single Unicode character or as the composition of two: the lowercase ‘e’ and the umlaute, ‘¨’. In both cases the displayed result is the same, ë, but the internal representation is different.
This matters a lot when you’re checking for equality, as in a search say. If someone enters their name as Zoë you may not find it in your database if it’s stored in a different representation. Unicode normalization is a way of solving this problem. The idea is to convert all representations of ë to a single one. According to Segala, there are 4 standard representations and you can use whichever is more convenient but the goal is to have all characters use the same one.
Happily, there are standard library functions to do this. For example, in JavaScript you can use the built-in function String.prototype.normalize() to do it. Take a look at Segala’s post for details and more information.