I’ve been using Unix for 3 decades but I just learned something new about two of the basic Unix tools sort
and uniq
. If you’re not familiar with Unix command line tools, sort
does just what it sounds like. You can specify fields of interest and sort
will sort the file based on those fields. The uniq
filter takes a line-sorted input and eliminates duplicate lines.
The sort
function has a -u
option that eliminates duplicate lines after sorting. For years, I—along with many other people, apparently—thought of the two commands
sort -u ...
and
sort ... | uniq
as being the equivalent. That turns out to not be true.
One of the things old Unix-heads like to complain about is the proliferation of option flags to the basic commands. Unix orthodoxy decrees that each tool should do just one job and do it well and that additional functionality should be implemented as a filter in a pipeline. Thus the proper way of eliminating duplicate lines in a file is
sort file | uniq
In a recent discussion on The Unix Hertigate Society mailing list on the topic of extraneous command options, Doug McIlroy explains that there’s a good reason for the seemingly redundant -u
option to sort and that there’s subtle difference between sort -u ...
and sort ...| uniq
. Follow the link to see what that difference is and why sort
’s -u
option make sense.