Mawking AWK with Lisp

Back in 2009, Brendan O'Connor over at AI and Social Science posted an article entitled Don't MAWK AWK—the fastest and most elegant big data munging language! He recently posted an update that caused the original article to pop up on my radar. The problem that the post addressed was transforming about a gigabyte of data into a form that could be used by Matlab and R. The transformation itself was simple and wouldn't normally be something that Irreal readers would find interesting—it took just 3 lines of AWK code—except for a surprising result.

Here's the surprise: since the AWK script was so simple, O'Connor and a colleague wrote versions of the program in AWK, Java, Python, Perl, Ruby, and two versions in C++. Altogether they ran 9 tests and the fastest version was in AWK. It turns out that the mawk version of AWK outperformed C++, Java, Python, Perl, Ruby, and its two siblings, nawk and gawk. That's pretty surprising. Of course the versions in other languages were all longer than the AWK version too; in some cases an order of magnitude larger. See O'Connor's post for the timings and details.

Naturally, I had to try it in Lisp to see how it would compare in speed and size. In order to get an idea of how Lisp compares to the other languages, I reran the tests for every language I have installed. That leaves out Java and Ruby but there's enough to get a good idea of how Lisp compares. All the tests were run of my iMac. Here's the system specifications:

file:///Users/jcs/org/blog/this-mac.png

Language Version Time (min:sec) Lines of Code
mawk 1.3.4 0:44.634 3
C-ish C++ 4.2.1 llvm-g++ 0:53.255 42
Python 2.7.1 1:21.003 20
Lisp-opt SBCL 1.1.1 1:54.576 19
Perl 5.12.3 1:57.928 17
Lisp SBCL 1.1.1 1:58.193 16
C++ 4.2.1.llvm-g++ 2:49.368 48
awk 20070501 3:18.187 3
Ruby 1.8.7 4:45.760 22

Here's the Lisp code (see O'Connor's post for the other code):

(load #P "/Users/jcs/quicklisp/setup.lisp")
(require ':jcs-utils)
(require ':split-sequence)
(let ((j 0) (jmap (make-hash-table :test 'equal)))
  (with-open-file (v "vocab" :direction :output )
    (dolist (f (cdr sb-ext:*posix-argv*))
      (let ((i 0) (imap (make-hash-table :test 'equal)))
        (with-open-file (fs (concatenate 'string f "n") :direction :output)
          (jcs-utils:dolines (l f)
            (destructuring-bind (item feature value) (split-sequence:split-sequence #\Space l)
              (unless (gethash item imap)
                (setf (gethash item imap) (incf i)))
              (unless (gethash feature jmap)
                (setf (gethash feature jmap) (incf j))
                (format v "~a~%" feature))
              (format fs "~a ~a ~a~%" (gethash item imap) (gethash feature jmap) value))))))))

The Lisp-opt run used the same code except that I added some optimization declarations and specified an initial size for the hash tables. As you can see, it didn't make very much difference. I'm a bit surprised that the Python version is faster but the most salient result remains that mawk outperformed everything else, even C++.

Update:

I didn't realize that Ruby is installed by default on OS X. For completeness, I've added Ruby to the results.

This entry was posted in Programming and tagged , . Bookmark the permalink.