door in the open

Searching for Words

2015-01-16

Yes, an appropriate name for my very first tech post.

Long hours ago, it turned out that I needed a very simple search. Very simple indeed, for the tokens to be searched are words from a list. The first thing that comes to mind is to do a simple database lookup; and if you have been tempted to try that you know that it does not feel right, mostly due to lack of both suggestions and relevant ordering of results.

First way forward: implement something like whoosh + haystack (I am a djangonaut, proudly said I). Second way forward: use the opportunity to play around with trees; and avoid the sense of guilt for adding two python libraries just for that simple search.

This article is about growing and using BK (Burkhard and Keller) trees for word searches. Here is one such tree.

Preparation instructions
We pick the first word and make it the tree root; it can be any word. We pick the next word, and make it a leaf, with the edge between both words indicating the edit (aka Levenshtein) distance. In our case, pavel is the root, ste is the leaf and the edit distance between these is 4. We pick the next word (stela) and find out that its edit distance from the root is 4 again; however, there already is a node at 4, so we have to attach the new leaf somewhere behind the ste node. To do that, we move onto the subtree which has ste as the root, and repeat: the distance between ste and stela is 2, and as there is nothing at distance 2 away from the local root we attach the leaf there. We pick the last word and it finds its place at distance 5 from the root.

class Node: def __init__(self, token): self.token = token self.edges = {} def append(self, token): d = get_edit_distance(self.token, token) if d in self.edges: self.edges[d].append(token) else: self.edges[d] = Node(token) def as_dict(self): d = {0: self.token} for key, edge in self.edges.items(): d[key] = edge.as_dict() return d tree = Node('pavel') for token in ['ste', 'stela', 'stella']: tree.append(token) tree_dict = tree.as_dict()

Now we can use the tree to search for all the words within some edit distance from the query. For example, we want stel and we want the results to be within edit distance of 2 from stel. As the edit distance between stel and the root pavel is 3, we will walk all the nodes at edit distances between 3-2 and 3+2, in our case the nodes ste and stella. The former has more nodes attached to it, so we repeat: the edit distance between the query stel and the node ste is 1, so we will traverse all the nodes at edit distance between 1-2 and 1+2 from that node, etc. Of course, along the way we collect all the words which are within edit distance of 2 from the query.

class Search: def __init__(self, tree, query, tol=2): self.query = query self.hoard = {} self.tol = tol self.run(tree) def run(self, subtree): d_root = get_edit_distance(self.query, subtree[0]) if d_root >= -self.tol and d_root <= self.tol: if d_root not in self.hoard: self.hoard[d_root] = [] self.hoard[d_root].append(subtree[0]) for i in range(d_root-self.tol, d_root+self.tol+1): if i and i in subtree: self.run(subtree[i]) s = Search(tree_dict, 'stel') print(s.hoard)

There is only one missing function.

def get_edit_distance(abscissa, ordinate): matrix = [] for y in range(0, len(ordinate)+1): row = [] for x in range(0, len(abscissa)+1): if x == 0 and y == 0: row.append(0) continue if y-1 >= 0: top = matrix[y-1][x] + 1 else: top = 666 if x-1 >= 0: left = row[x-1] + 1 else: left = 666 if x-1 >= 0 and y-1 >= 0: diagonal = matrix[y-1][x-1] if abscissa[x-1] != ordinate[y-1]: diagonal += 1 else: diagonal = 666 row.append(min(top, left, diagonal)) matrix.append(row) return matrix[y][x]

What is that tree good for?
Whenever we have a big list of items, be it a dictionary or a spacecraft catalogue, we can use the BK algorithm to search through these (1) quickly, (2) allowing for alternative (mis)spellings, and (3) providing query results ordered by decreasing relevance. The only condition is that we have reliable metrics for comparing items of the list with each other and with the query.