Douwe Osinga's Blog: Building a Reverse Image Search Engine using a Pretrained Image Classifier

Thursday, January 26, 2017

In the Olden Days, say more than 10 years ago, building Image Search was really quite hard (see xkcd). In fact, when Alta Visa first came out with a search engine for images, they couldn't really do it. What they did was return the image that had text around it that most matched your query. It's a wonder that that worked at all, but it had to do for years and years.
Why we need Reverse Image Search: find more cat pictures (from: wikipedia)
How things have changed. These days Neural Networks have no problem detecting the actual content of pictures, in some categories outperforming their human masters. An interesting development here is reverse image search - supply a search engine with an image and it will tell you where else this or similar images occur on the web. Most articles on the web describing approaches on how to do this focus on things like Perceptual Hashing. While I am sure that is a good way, it struck me that there is a much simpler way.

Embeddings! Algorithms like Word2Vec train a neural network for a classification task, but they don't use the learned classification directly. Instead they use the layer just before the classification as a representation of the word. Similarly, we can use a pre-trained image classifier and run it on a collection of images, but rather than using the final layer to label the result, we get the layer before that and use that as a vector representation of the image. Similar images will have a similar vector representation. So finding similar images becomes just a nearest neighbor search.

As with a lot of things like this, getting the data to run the algorithm on is more work than getting the algorithm to run. Where do we get a set of representative images from? The images from the Wikipedia are a good start, but we might not want all of them. Most articles are about specific instances of things - for a reverse image search demo, classes of things are more interesting. We're interested in cats, not specific cats.

Luckily, Wikidata annotates its records with a 'is instance of' property. If you have imported a Wikidata snapshot into Postgres, then getting the wikipedia_ids for all values for the instance-of property is a simple SQL statement:
select properties->>'instance of' as thing, count(*) as c 
from wikidata group by thing

For some of these, Wikidata also provides us with a canonical image. For others we have to fetch the wikipedia page and parse the wikicode. We're just going to get the first image that appears on the page, nothing fancy. After an hour of crawling, we end up with a set of roughly 7 thousand images.

SkLearn provides us with an k-nearest-neighbor algorithm implementation and we're off to the races. We can spin-up a flask based server that accepts an image as a POST request, feeds that image into our pre-trained classifier. From that we'll get the vector representing that image. We then feed that vector into the nearest neighbor model and out fall the most similar images. You can see a working demo here

It mostly works well. If you feed it a cat, it will return pictures of cats, the definition of success on the Internet. On mobile, you can directly upload a picture from your phone's camera and that seems to go ok, too. The biggest limitation I've come across so far is that the algorithm is bad at estimating how good its guesses are. So if there aren't any suitable pictures in the training set, it will return the one that it thinks is the closest match, but to the human eye it seems fairly unrelated.

As always, you can find the code on Github.