Exploratory Data Analysis With AWK

This post is about Brian Kernighan and AWK, two things always worth hearing about. All of you, I’m sure, know that after retiring from Bell Labs, Kernighan took a professorship at his Alma Mater, Princeton. Since he’s been there, Kernighan has often offered courses in Computer Science aimed at students in the Humanities, such as Computers in our World.

Last Spring he taught a course in the Humanities department, Literature as Data, that considered using computational techniques for answering questions in literature research. The usual vehicle for this sort of thing is Python and the Pandas library but Kernighan and his co-instructor decided to emphasize AWK and how it can be used in exploratory data analysis.

After the class was over, Kernighan wrote a short essay describing the class and their experience with it. He gives examples of many of the questions they explored by running AWK against a database of sonnets and their authors. The database is in CSV format and perfect for exploring with AWK. Some of the results were surprising. For example, although sonnets are supposed to be exactly 14 lines long, some—even some by Shakespeare—have more or less lines. One even had 123 lines. Who were the poets who broke the rules? AWK makes it easy to explore the question.

Kernighan’s essay has more examples of questions they were able to explore by querying the database with AWK. The essay links to the database they used so you can play with it yourself if you’re inclined. It’s an interesting essay and a short read so well worth spending a couple of minutes on.

This entry was posted in General and tagged , . Bookmark the permalink.