Finding Intersections Redux

Last month, I published a post on Finding The Intersection of Two Lists. The impetus for the post was an Emacs Stack Exchange question asking if there was a better way to find the intersection of two lists than looping over both of them. I thought putting the entries of the first list in a hash table and checking the entries of the second list against it was a better solution and I wrote some code to check out that hypothesis. You can see the details in the original post.

I got some gentle pushback from two Elisp luminaries, Chris Wellons and Alphapapa. Wellons noted that I could get a small speed increase by using dolist instead of mapc. More importantly, he noted that I should have compiled my code with lexical mode as that makes a large difference in running time.

Alphapapa, commenting as NoonianAtall, noted that I wasn’t eliminating duplicates and pointed me at some macros that make benchmarking this type of thing easier. He published a gist that showed one of those macros in action.

Naturally, I had to check all this out. I wrote a new version of hash-intersection that used dolist instead of mapc and also made a small change (hash-intersection-dedup) to the algorithm so that duplicates were eliminated. Finally, I stole Alphapapa’s version of the code that uses the loop macro instead dolist.

Here’s all the code, including the bit to generate random lists, for reference:

(require 'epdh)
(require 'dash-functional)
(defun make-random-list (n)
  (let ((l nil))
    (dotimes (i n l)
      (push (random 1500) l))))

(defconst lst1 (make-random-list 1000))
(defconst lst2 (make-random-list 1000))

(defun hash-intersection-orig (l1 l2)
  (let ((ht (make-hash-table :test #'equal))
        (acc nil))
    (mapc (lambda (x) (puthash x t ht)) l1)
    (mapc (lambda (x) (if (gethash x ht nil)
                          (push x acc)))
          l2)
    acc))

(defun hash-intersection-dolist (l1 l2)
  (let ((ht (make-hash-table :test #'equal))
        (acc nil))
    (dolist (l l1)
      (puthash l t ht))
    (dolist (l l2)
      (if (gethash l ht nil)
          (push l acc)))
    acc))

(defun hash-intersection-dedup (l1 l2)
  (let ((ht (make-hash-table :test #'equal))
        (acc nil))
    (dolist (l l1)
      (puthash l t ht))
    (dolist (l l2)
      (when (gethash l ht nil)
        (puthash l nil ht)              ;only one to a customer
        (push l acc)))
    acc))

;; From Alphapapa
(defun hash-intersection-loop (l1 l2)
  (let ((ht (make-hash-table :test #'equal) ))
    (cl-loop for e1 in l1
             do (puthash e1 t ht))
    (cl-loop for e2 in l2
             when (gethash e2 ht)
             collect it)))

Now we’re ready to test Wellons’ claims. First, I used the bench-dynamic-vs-lexical-binding macro to measure the difference that compiling with lexical bindings made:

(bench-dynamic-vs-lexical-binding
  :times 100
  :forms (("hash-intersection-orig"   (progn
                                        (defun hash-intersection-orig (l1 l2)
                                          (let ((ht (make-hash-table :test #'equal))
                                                (acc nil))
                                            (mapc (lambda (x) (puthash x t ht)) l1)
                                            (mapc (lambda (x) (if (gethash x ht nil)
                                                                  (push x acc)))
                                                  l2)
                                            acc))
                                        (hash-intersection-orig lst1 lst2)))))
Form x faster than next Total runtime # of GCs Total GC runtime
Lexical: hash-intersection-orig 1.25 0.027208 0 0
Dynamic: hash-intersection-orig slowest 0.033979 0 0

Just as Wellons promised, the lexical version is 25% faster. Next, I compared all the implementations (as well as seq-difference) compiled with lexical bindings. There were a couple of surprises. First, the deduping version ran slightly faster than the dolist version. That was a pretty consistent result across several runs of the comparison. The only reason for it that I can see is that eliminates the push for all the duplicates.

A bit more surprising is that the dolist version ran 28% faster than the mapc version. That seems pretty high but the numbers did jump around a bit between different runs so perhaps it’s just an anomaly. If you want to play around with Alphapapa’s macros (they’re here) that would be a good experiment to run. There are more macros than I’ve mentioned here so they should meet whatever your benchmarking needs are.

(bench-multi-lexical
  :times 100
  :forms (("hash-intersection-orig"   (progn
                                        (defun hash-intersection-orig (l1 l2)
                                          (let ((ht (make-hash-table :test #'equal))
                                                (acc nil))
                                            (mapc (lambda (x) (puthash x t ht)) l1)
                                            (mapc (lambda (x) (if (gethash x ht nil)
                                                                  (push x acc)))
                                                  l2)
                                            acc))
                                        (hash-intersection-orig lst1 lst2)))
          ("hash-intersection-dolist"   (progn
                                          (defun hash-intersection-dolist (l1 l2)
                                            (let ((ht (make-hash-table :test #'equal))
                                                  (acc nil))
                                              (dolist (l l1)
                                                (puthash l t ht))
                                              (dolist (l l2)
                                                (if (gethash l ht nil)
                                                    (push l acc)))
                                              acc))
                                          (hash-intersection-dolist lst1 lst2)))
          ("seq-intersection" (seq-intersection lst1 lst2))
          ("hash-intersection-loop"   (progn
                                        (defun hash-intersection-loop (l1 l2)
                                          (let ((ht (make-hash-table :test #'equal) ))
                                            (cl-loop for e1 in l1
                                                     do (puthash e1 t ht))
                                            (cl-loop for e2 in l2
                                                     when (gethash e2 ht)
                                                     collect it)))
                                        (hash-intersection-loop lst1 lst2)))
          ("hash-intersection-dedup"   (progn
                                         (defun hash-intersection-dedup (l1 l2)
                                           (let ((ht (make-hash-table :test #'equal))
                                                 (acc nil))
                                             (dolist (l l1)
                                               (puthash l t ht))
                                             (dolist (l l2)
                                               (when (gethash l ht nil)
                                                 (puthash l nil ht) ;only one to a customer
                                                 (push l acc)))
                                             acc))
                                         (hash-intersection-dedup lst1 lst2)))))
Form x faster than next Total runtime # of GCs Total GC runtime
hash-intersection-dedup 1.02 0.021943 0 0
hash-intersection-loop 1.09 0.022341 0 0
hash-intersection-dolist 1.28 0.024367 0 0
hash-intersection-orig 315.94 0.031197 0 0
seq-intersection slowest 9.856230 0 0

Finally, I should mention that the macros are part of Alphapapa’s Emacs Package Developer’s Handbook, which has a wealth of useful information—even if you aren’t writing packages—and is definitely worth taking a look at if you haven’t already.

Update [2020-02-08 Sat 11:22]: NoonianAtall (Alphapapa) says that I should have included cl-intersection and -intersection in my comparisons. Fortunately, he’s already done that in the gist that I mentioned so you can check them out there. Cl-intersection is, of course, included with Emacs and -intersection is from Dash which you probably also have since it’s a dependency of many popular packages.

This entry was posted in General and tagged , . Bookmark the permalink.