Coffee projected on a map |
To make this work in an online situation, I imported the word2vec data into postgres. If you want to play with that yourself, you can find the code on github.
Since the underlying model is trained on a Google News archive, some biases shine through. There are some countries that don't appear often in the news - Chad, the Central African Republic and the Republic of Congo (not to be confused with the Democratic Republic of Congo) spring to mind. This makes the vectors of those countries unstable. One article about a guy who went walking in Chad and now Chad lights up for the word walk, even though it isn't particularly related.
The US has the opposite problem. American news talks about "the average American" or "in the US" when the subjects discussed aren't particularly American at all. So the US tends does well for day-to-day terms and maybe underscores a bit for international queries. I created a small spin-off thing, usmapof that uses the names of the US states instead. Comparing the maps for "Germany", "Sweden" and "Norway" gives you an idea where migrants from those countries ended up. Or if you want to know where hockey is popular:
Hockey lights up the north |
It's fun to play with, but sometimes you see the limits of the model shine through. The data is somewhat old, so you can't use it well to illustrate current political events. Moreover, names of states are somewhat poor representations of the underlying entities. Washington usually does not mean the state. England makes New England light up for the US, but probably not because so many English settlers went there.
So I wonder if we can do better. What if instead of running a skip-gram algorithm over windows of words, we preprocessed the text into entities first? Then quite possibly the model would learn which entities have similar roles, rather than which words have similar roles. We might want to incorporate somehow even the roles of entities in sentences, which might allow the model to learn from a fragment like "Oil was found in Oklahoma" that oil is something that can be found and that Oklahoma is a place.
Maybe I should try SyntaxNet out for this and see what happens.
0 comments:
Post a Comment