# The Social Network of a Word

I ran across an interesting interview question the other day: Given a base word and a list of other words, use the Levenshtein distance to find the social network of the base word in the list of other words. More specifically, we call wordb a friend of worda if the Levenshtein distance between them is 1. The problem is to start with the base word, find all its friends in the other list, then find all the friends of the friends, and so on. This problem is representative of a large class of related problems (such as The Six Degrees of Kevin Bacon) so it's worthwhile considering how to solve it.

The Wikipedia article on Levenshtein distance has a straightforward pseudocode implementation of an algorithm to find the Levenshtein distance between two words that I translated directly into Lisp. I won't bother showing the code since it's pretty much a one-to-one translation.

The basic strategy is to do a breadth-first search of the base word's social graph. We don't have the graph, of course, that's what we want to find. Instead, we build it up piece by piece. The main tool for doing breadth-first searches is a queue or FIFO. I've disussed these and given an implementation before. The implementation at the link is in Scheme but it's trivial to convert it to Common Lisp. Again, I won't bother showing the Lisp implementation.

With those two pieces in place, it's pretty easy to implement the algorithm to find the social network.

``` 1:  (defun social (word words &optional graph)
2:    "Find the social graph for word in the population words"
3:    (let ((inside (make-instance 'queue))
4:          (outside (make-instance 'queue :initial-contents words))
5:          (new-outside (make-instance 'queue))
6:          (results (list (cons word 0)))
7:          nodes)
8:      (pushq (cons word  0) inside)
9:      (do ((datum (popq inside) (popq inside)))
10:          ((null datum) (if graph (sgraph (reverse nodes)) (reverse results)))
11:        (destructuring-bind (word . distance) datum
12:          (do ((potential-friend (popq outside) (popq outside)))
13:              ((null potential-friend))
14:            (if (= (levenshtein word potential-friend) 1)
15:                (let ((friend (cons potential-friend (1+ distance))))
16:                  (pushq friend inside)
17:                  (push friend results)
18:                  (push (cons (car datum) potential-friend) nodes))
19:                (pushq potential-friend new-outside))))
20:        (rotatef new-outside outside))))
```

The implementation uses 3 queues:

• The `outside` queue that holds words that have not yet been identified as friends. It's initial value is the list of other words.
• The `inside` queue that holds friends that are waiting to be processed. It is initialized to the base word with a distance (from the base word) of zero at line 8.
• The `new-outside` queue that holds words from the `outside` queue that aren't friends of head of the `inside` queue.

The main loop pops off the head of the `inside` queue at line 9 and checks each word on the `outside` queue to see if it is a friend. If it is, it adds that word and its distance from the base word to the end of the `inside` queue (line 16), it adds the same to the `results` list (line 17), and finally adds the two words being compared onto the `nodes` list (line 18). If the word being checked is not a friend, it is pushed onto the end of the `new-outside` queue.

When the `outside` queue is empty, it is swapped with `new-outside` and the main loop starts over. When the `inside` queue goes empty, the process is complete. Assuming `graph` is `NIL`, the results are returned.

If `*words*` is given by

```(defparameter *words* '("milk" "silk" "mill" "mails" "mills" "miles" "silly"
"smiles" "smile" "mile" "sill" "hilly" "shill"
"mall" "bill" "bull"))
```

and the base word is `bilk`, the results of a run are:

```SOCIAL> (social "bilk" *words* )
(("bilk" . 0) ("milk" . 1) ("silk" . 1) ("bill" . 1) ("mill" . 2) ("mile" . 2)
("sill" . 2) ("bull" . 2) ("mills" . 3) ("mall" . 3) ("miles" . 3)
("smile" . 3) ("silly" . 3) ("shill" . 3) ("smiles" . 4) ("hilly" . 4))
```

Thus "milk" is a direct friend of "bilk" and "mills" is a friend of a friend of a friend.

We can see the network more clearly by specifying `graph` as `t`. That causes `social` to call `sgraph` to output a `dot` program to produce a graph. Here's `sgraph`:

```(defun sgraph (edges)
"Output a dot program for the social graph of word in words."
(with-output-to-string (stream)
(format stream "graph G { ~%")
(mapc (lambda (e) (format stream "~a -- ~a~%" (car e) (cdr e))) edges)
(format stream "}~%")))
```

For our example, this produces the program

```graph G {
bilk -- milk
bilk -- silk
bilk -- bill
milk -- mill
milk -- mile
silk -- sill
bill -- bull
mill -- mills
mill -- mall
mile -- miles
mile -- smile
sill -- silly
sill -- shill
miles -- smiles
silly -- hilly
}
```

which produces the graph

This entry was posted in Programming and tagged . Bookmark the permalink.
• This method misses out some friendships, such as 'smile' and 'smiles' in the example above which have a Levenshtein distance of 1. Could the search be modified to generate a graph which identified all L=1 relationships in the data set, avoiding looping infinitely or generating duplicates?

• Or even worse, bill and mill. It wouldn't be hard to get them all by adding a little bookkeeping and delaying switching `outside` and `new-outside` but the original problem statement, as I understood it, didn't require this. Still, if you wanted a real social graph that's what you'd do.

• You're quite right - I hadn't read the question properly! Your original method is an elegant solution. Great blog, thanks :)