Sort and Uniq

I’ve been using Unix for 3 decades but I just learned something new about two of the basic Unix tools sort and uniq. If you’re not familiar with Unix command line tools, sort does just what it sounds like. You can specify fields of interest and sort will sort the file based on those fields. The uniq filter takes a line-sorted input and eliminates duplicate lines.

The sort function has a -u option that eliminates duplicate lines after sorting. For years, I—along with many other people, apparently—thought of the two commands

sort -u ...

and

sort ... | uniq

as being the equivalent. That turns out to not be true.

One of the things old Unix-heads like to complain about is the proliferation of option flags to the basic commands. Unix orthodoxy decrees that each tool should do just one job and do it well and that additional functionality should be implemented as a filter in a pipeline. Thus the proper way of eliminating duplicate lines in a file is

sort file | uniq

In a recent discussion on The Unix Hertigate Society mailing list on the topic of extraneous command options, Doug McIlroy explains that there’s a good reason for the seemingly redundant -u option to sort and that there’s subtle difference between sort -u ... and sort ...| uniq. Follow the link to see what that difference is and why sort’s -u option make sense.

This entry was posted in General and tagged . Bookmark the permalink.