03 May 2019

Shared Words Addendum

In the last post we looked at a Tube map connecting 173 of the stations of the London Underground according to the words shared between their names. There were 97 stations left over (with 95 names between them, because there are two "Hammersmith"s and two "Paddington"s): what should we do with those?

The straightforward solution would be to arrange all these stations on a "leftovers" line, in alphabetical order. But we can try something more elaborate: what if we arranged the stations according to how similar their names are?

Two things we need. First we need a procedure for working out how similar two stations' names are. Second we need a procedure for generating a made-up network from these similarity ratings.

Evaluating similarity

We could just look at every pair of stations and score how similar they sound, but this would be fairly subjective, as well as very time-consuming (with 95 station names there are 4465 pairs of names to rate). So instead I used SeatGeek's FuzzyWuzzy library, which algorithmically works out how closely two strings (like two station names) match.

Each pair of names is assigned a score between 0% and 100%. For example: FuzzyWuzzy assigns "Northfields" and "Southfields" an 82% match; "Kennington" and "Kenton" are assigned a 75% match; and "Bank" and "Whitechapel" are assigned a 13% match.

A different procedure would likely give us different results, and ultimately change the arrangement of stations on our map. But since this is all for fun, we need not fuss about that too much.

Generating a network

Now we need to turn these 4456 percentages into a network. We can use the following method:
  1. Start with one station in our network
  2. Find the closest similarity between a station inside our network and a station outside
  3. Create a link between the two stations found in step (2)
  4. Repeat steps (2) and (3) until every station is part of the network
For example, let's start with Balham in the network. According to FuzzyWuzzy, the most similar station name to "Balham" is "Amersham" with a 57% match, so we join Balham to Amersham.

Now the closest match to "Balham" is "Archway" (42%) and the closest match to "Amersham" is "Chesham" (67%). Since the "Amersham"-"Chesham" percentage is higher than "Balham"-"Archway", we join Amersham to Chesham.

We keep going until our last link is added: "Victoria" matches poorly with every other station, but its best match is 42% with "Blackfriars" (I guess because of the shared "c...ria"), so that's where it goes.

This procedure for growing the network is called Prim's algorithm. An interesting property of the algorithm is that the network ends up the same no matter which station we choose to start with.

The final map

Then all we have to do is draw the network with a little Tube map flair, and we end up with this:

No comments:

Post a Comment

Note: only a member of this blog may post a comment.