door in the open

Searching for Words in Django

2015-01-25

This is a follow-up on Searching for Words.

The cute BK tree algorithm from the last post was not to live in isolation from the rest of the internet world. It was destined to be but a module of a bigger django machine powering a nice well-intentioned website.

The sweet BK tree can interact with the django app it lives in through two functions only: one will build the tree and save it somewhere for the other one to search it. Views will call the latter, while the former could be invoked by either management commands, signal hooks, or both.

As the rest of the app does not care how and where is the tree stored (or, indeed, whether it is stored at all, or re-generated each time a search is performed — which would be fine with small data), we will make interaction with storage internal to the comely BK tree module.

Where should we store the tree though?
It could be either in the filesystem, in local memory (makes sense if the data is small and your machine has spare memory), or distributed in other machines' memories (talking about you, memcached). The decision is up to you. However, it turns out that all these methods are viable backends of django's cache system, so in our code we can use the latter and let django carry out the implementation. Look at our search.py module below:

from django.core.cache import cache from app.models import Word CACHE_KEY = 'app_search_index' class Node: def __init__(self, token): self.token = token self.edges = {} def append(self, token): d = get_edit_distance(self.token, token) if d in self.edges: self.edges[d].append(token) else: self.edges[d] = Node(token) def as_dict(self): d = {0: self.token} for key, edge in self.edges.items(): d[key] = edge.as_dict() return d def make_index(): """ Creates the search index and saves it in the cache. """ all_words = Word.objects.all() root = Node(all_words[0].word) for word in all_words[1:]: root.append(word.word) tree = root.as_dict() cache.set(CACHE_KEY, tree, None) return tree class Search: def __init__(self, tree, query, tol=2): self.query = query self.hoard = {} self.tol = tol self.run(tree) def run(self, subtree): """ Recursive running through the nodes of interest. """ d_root = get_edit_distance(self.query, subtree[0]) if d_root >= -self.tol and d_root <= self.tol: if d_root not in self.hoard: self.hoard[d_root] = [] self.hoard[d_root].append(subtree[0]) for i in range(d_root-self.tol, d_root+self.tol+1): if i and i in subtree: self.run(subtree[i]) def search(query): """ Returns an ordered list of words matching the query given. To be used in the views. """ tree = cache.get(CACHE_KEY) if tree is None: tree = make_index() s = Search(tree, query) word_list = [] for key in sorted(s.hoard): for token in s.hoard[key]: word_list.append(token) return word_list def get_edit_distance(abscissa, ordinate): """ Returns the Levenshtein distance between the two strings given. """ matrix = [] for y in range(0, len(ordinate)+1): row = [] for x in range(0, len(abscissa)+1): if x == 0 and y == 0: row.append(0) continue if y-1 >= 0: top = matrix[y-1][x] + 1 else: top = 666 if x-1 >= 0: left = row[x-1] + 1 else: left = 666 if x-1 >= 0 and y-1 >= 0: diagonal = matrix[y-1][x-1] if abscissa[x-1] != ordinate[y-1]: diagonal += 1 else: diagonal = 666 row.append(min(top, left, diagonal)) matrix.append(row) return matrix[y][x]

Now we are ready to use search() in our views and make_index() in a signal hook or management command. Here is an implementation of the former (living in signals.py):

from django.db.models.signals import post_save, post_delete from django.dispatch import receiver from app.models import Word from app.search import make_index import logging @receiver([post_save, post_delete], sender=Word, dispatch_uid='app_post_save') def hook_post_save_word(sender, **kwargs): try: make_index() except Exception as error: logger = logging.getLogger('app') logger.exception(error)

Mind the pitfall!
Running unit tests must be very likely to invoke re-building of the search index. Guess what will happen if you do something like python manage.py test on your server machine. Thus, we have to make sure that a different cache system is used for testing purposes. Here is how our settings might look like:

CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.filebased.FileBasedCache', 'LOCATION': '/tmp/django_cache' } } if 'test' in sys.argv: CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.locmem.LocMemCache', 'LOCATION': 'stela' } }