Over at his blog, Ted Dziuba has posted a valuable piece of wisdom that is easy to forget. The TL;DR on the wisdom is that many tasks for which we instinctively turn to heavy-weight, distributed tools are more easily and safely accomplished with a pipeline of standard Unix utilities.
That can be a hard lesson to learn. When I was young and foolish I would often throw together such a solution but because I was silly would think of it more as a proof-of-concept and go on to build a “real” solution, usually in C. And that was true even though the original solution solved the problem perfectly well and had a perceived run time that was virtually indistinguishable from the one written in C.
Doubtless, I’m still doing silly things but at least I’ve gotten over that particular brand of it. Not all jobs are amenable to a Unix pipeline, of course, but many are and it’s worth looking at each task to see if they qualify before renting a bunch of virtual servers and firing up Hadoop.
Back in the Old Days™, resources were more limited so solutions like those recommended by Dziuba were more natural and pipelines were the go-to tool. That’s less true today where resources are plentiful and the command line is often looked at the same way we look at COBOL. Still, even though it’s an old post it’s nevertheless worth reading and taking to heart.