Thursday, July 10, 2014

MITIE v0.2 Released: Now includes Python and C++ APIs for named entity recognition and binary relation extraction

A few months ago I posted about MITIE, the new DARPA funded information extraction tool being created by our team at MIT. At the time it only provided English named entity recognition and sported a simple C API.  Since then we have been busy adding new features and today we released a new version of MITIE which adds a bunch of nice things, including:
  • Python and C++ APIs
  • Many example programs
  • 21 English binary relation extractors which identify pairs of entities with certain relations.  E.g. "PERSON BORN_IN PLACE"
  • Python, C, and C++ APIs for training your own named entity and binary relation extractors
You can get MITIE from its github page.  Then you can try out some of the new features in v0.2, one of which is binary relation extraction.  This means you can ask MITIE if two entities participate in some known relationship, for example, you can ask if a piece of text is making the claim that a person was born in a location.  I.e. Are the person and location entities participating in the "born in" relationship?

In particular, you could run MITIE over all the Wikipedia articles that mention Barack Obama and find each instance where someone made the claim that Barack Obama was born in some place.  I did this with MITIE and found the following:

  • 14 claims that Barack Obama was born in Hawaii
  • 5 claims that Barack Obama was born in the United States
  • 3 claims that Barack Obama was born in Kenya

Which is humorous.  One of them is the sentence:
You can still find sources of that type which still assert that "Barack Obama was born in Kenya"
When you read it in the broader context of the article it's clear that it's not claiming he was born in Kenya.  So this is a good example of why it's important to aggregate over many relation instances when using a relation extractor.  By aggregating many examples we can get reasonably accurate outputs in the face of these kinds of mistakes.  

However, what is even more entertaining than poking fun at American political dysfunction is MITIE's new API for creating your own entity and relation extractors.  We worked to make this very easy to use, and in particular, there are no parameters you need to mess with, everything is dealt with internal to MITIE.  All you, the user, need to do is give example data showing what you want MITIE to learn to detect and it takes care of the rest.  Moreover, in the spirit of easy to use APIs, we also added a new Python API that allows you to exercise all the functionality in MITIE via Python.  As a little example, here is how you use it to find named entities:
from mitie import *
ner = named_entity_extractor('MITIE-models/english/ner_model.dat')
tokens = tokenize("The MIT Information Extraction (MITIE) tool was created \
                   by Davis King, Michael Yee, and Wade Shen at the \
                   Massachusetts Institute of Technology.")
print tokens
This loads in the English named entity recognizer model that comes with MITIE and then tokenizes the sentence.  So the print statement produces 
['The', 'MIT', 'Information', 'Extraction', '(', 'MITIE', ')', 'tool', 'was', 'created', 'by', 'Davis', 'King', ',', 'Michael', 'Yee', ',', 'and', 'Wade', 'Shen', 'at', 'the', 'Massachusetts', 'Institute', 'of', 'Technology', '.']
Then to find the named entities we simply do
entities = ner.extract_entities(tokens)
print "Number of entities detected:", len(entities)
print "Entities found:", entities
Which prints:
Number of entities detected: 6
Entities found: [(xrange(1, 4), 'ORGANIZATION'), (xrange(5, 6), 'ORGANIZATION'), (xrange(11, 13), 'PERSON'), (xrange(14, 16), 'PERSON'), (xrange(18, 20), 'PERSON'), (xrange(22, 26), 'ORGANIZATION')]
So the output is just a list of ranges and labels.  Each range indicates which tokens are part of that entity.  To print these out in a nice list we would just do
for e in entities:
    range = e[0]
    tag = e[1]
    entity_text = " ".join(tokens[i] for i in range)
    print tag + ": " + entity_text
Which prints:
ORGANIZATION: MIT Information Extraction
PERSON: Davis King
PERSON: Michael Yee
PERSON: Wade Shen
ORGANIZATION: Massachusetts Institute of Technology