Finding The Intersection of Two Lists

I recently saw an Emacs Stack Exchange Question on how to find the intersection of two lists. The questioner asked if there was a better way than the obvious solution of looping over both lists looking for matches, a \(O(mn)\) solution for lists of size \(m\) and \(n\). The “best” solution was to use the Elisp function seq-intersection, which accomplishes the task in one function call but that under the covers loops over both loops giving it \(O(nm)\) performance

My first thought when I saw the question was to use a hash table to capture the items in the first list and then check the second list against the hash. That would be a \(O(n+m)\) solution but, of course, the multiplicative constants would be much bigger for \(O(n+m)\) than for \(O(nm)\). My gut feeling was that the hash table solution would be a bit faster but as my experiments with palindrome predicates showed, my gut is not particularly reliable.

The only answer was to run some benchmarks. I wrote a bit a Elisp to generate a couple of random lists and a function to find their intersection using the hash table method. The hash-intersection function is pretty simple. It adds each entry of the first list to a hash table with a value of t. Then it checks each entry in the second list to see if it’s in the hash table.

(defun make-random-list (n)
  (let ((l nil))
    (dotimes (i n l)
      (push (random 1500) l))))

(defconst lst1 (make-random-list 1000))
(defconst lst2 (make-random-list 1000))

(defun hash-intersection (l1 l2)
  (let ((ht (make-hash-table :test #'equal))
        (acc nil))
    (mapc (lambda (x) (puthash x t ht)) l1)
    (mapc (lambda (x) (if (gethash x ht nil)
                          (push x acc)))
          l2)
    acc))

Then I ran benchmarks that called hash-intersection and seq-intersection 100 times on the two lists. The benchmark-run macro reports total running time, the number of garbage collections, and the time taken by garbage collection.

(benchmark-run 100 (hash-intersection lst1 lst2))
0.081114 0 0.0
(benchmark-run 100 (seq-intersection lst1 lst2))
8.2707 0 0.0

As you can see, there was no garbage collection. Surprisingly, the hash-intersection function was about 100 times faster, which is much faster than I expected. I thought there might be garbage collection with the hash table method but there wasn’t—perhaps there would be with larger lists.

Regardless of the garbage collection results, the hash table method certainly takes more memory. Even so, the speed results suggest that for most reasonable inputs, the hash table method is a good choice. With shorts lists of 8 elements, both methods took about a microsecond for a single run so there doesn’t appear to be a penalty for short lists either.

This entry was posted in General and tagged , . Bookmark the permalink.