Thursday, July 10, 2014

MITIE v0.2 Released: Now includes Python and C++ APIs for named entity recognition and binary relation extraction

A few months ago I posted about MITIE, the new DARPA funded information extraction tool being created by our team at MIT. At the time it only provided English named entity recognition and sported a simple C API.  Since then we have been busy adding new features and today we released a new version of MITIE which adds a bunch of nice things, including:
  • Python and C++ APIs
  • Many example programs
  • 21 English binary relation extractors which identify pairs of entities with certain relations.  E.g. "PERSON BORN_IN PLACE"
  • Python, C, and C++ APIs for training your own named entity and binary relation extractors
You can get MITIE from its github page.  Then you can try out some of the new features in v0.2, one of which is binary relation extraction.  This means you can ask MITIE if two entities participate in some known relationship, for example, you can ask if a piece of text is making the claim that a person was born in a location.  I.e. Are the person and location entities participating in the "born in" relationship?

In particular, you could run MITIE over all the Wikipedia articles that mention Barack Obama and find each instance where someone made the claim that Barack Obama was born in some place.  I did this with MITIE and found the following:

  • 14 claims that Barack Obama was born in Hawaii
  • 5 claims that Barack Obama was born in the United States
  • 3 claims that Barack Obama was born in Kenya

Which is humorous.  One of them is the sentence:
You can still find sources of that type which still assert that "Barack Obama was born in Kenya"
When you read it in the broader context of the article it's clear that it's not claiming he was born in Kenya.  So this is a good example of why it's important to aggregate over many relation instances when using a relation extractor.  By aggregating many examples we can get reasonably accurate outputs in the face of these kinds of mistakes.  

However, what is even more entertaining than poking fun at American political dysfunction is MITIE's new API for creating your own entity and relation extractors.  We worked to make this very easy to use, and in particular, there are no parameters you need to mess with, everything is dealt with internal to MITIE.  All you, the user, need to do is give example data showing what you want MITIE to learn to detect and it takes care of the rest.  Moreover, in the spirit of easy to use APIs, we also added a new Python API that allows you to exercise all the functionality in MITIE via Python.  As a little example, here is how you use it to find named entities:
from mitie import *
ner = named_entity_extractor('MITIE-models/english/ner_model.dat')
tokens = tokenize("The MIT Information Extraction (MITIE) tool was created \
                   by Davis King, Michael Yee, and Wade Shen at the \
                   Massachusetts Institute of Technology.")
print tokens
This loads in the English named entity recognizer model that comes with MITIE and then tokenizes the sentence.  So the print statement produces 
['The', 'MIT', 'Information', 'Extraction', '(', 'MITIE', ')', 'tool', 'was', 'created', 'by', 'Davis', 'King', ',', 'Michael', 'Yee', ',', 'and', 'Wade', 'Shen', 'at', 'the', 'Massachusetts', 'Institute', 'of', 'Technology', '.']
Then to find the named entities we simply do
entities = ner.extract_entities(tokens)
print "Number of entities detected:", len(entities)
print "Entities found:", entities
Which prints:
Number of entities detected: 6
Entities found: [(xrange(1, 4), 'ORGANIZATION'), (xrange(5, 6), 'ORGANIZATION'), (xrange(11, 13), 'PERSON'), (xrange(14, 16), 'PERSON'), (xrange(18, 20), 'PERSON'), (xrange(22, 26), 'ORGANIZATION')]
So the output is just a list of ranges and labels.  Each range indicates which tokens are part of that entity.  To print these out in a nice list we would just do
for e in entities:
    range = e[0]
    tag = e[1]
    entity_text = " ".join(tokens[i] for i in range)
    print tag + ": " + entity_text
Which prints:
ORGANIZATION: MIT Information Extraction
ORGANIZATION: MITIE
PERSON: Davis King
PERSON: Michael Yee
PERSON: Wade Shen
ORGANIZATION: Massachusetts Institute of Technology

12 comments :

Unknown said...

Hi Davis,

Anyway to make dlib use nvblas?

Davis King said...

Yes, dlib is capable of using any optimized BLAS or LAPACK libraries that are installed on your system. To do this you define the DLIB_USE_BLAS and/or DLIB_USE_LAPACK preprocessor directives and then link your program with whatever BLAS or LAPACK libraries you have.

Rain Maker said...

Hi Davis,
Your dlib is awesome when predicting face pose. However, I want to align a not-front face pose to front face pose. I asked on Stackoverflow: http://stackoverflow.com/questions/36590516/how-to-get-3d-coordinate-axes-of-head-pose-estimation-in-dlib-c. But as the code provided by ZdaR, I need to know the 3D position of nose, eyes,... of your 3D model: "shape_predictor_68_face_landmarks.dat". But I don't know how to read or get data from this file. Could you have any suggestion?

Davis King said...

Take a look at this project https://github.com/chili-epfl/attention-tracker

Unknown said...

Hello Davis,

I am experimenting with Dlib Library and found its too useful. Thankx for making it opensource.

I have some query. I compiled the examples code on both Centos and Windows using cmake. However, I am facing difficulty in passing image folder through the command prompt in windows.
./face_detection_ex faces/*.jpg --this command worked on centos.
However if I am giving
face_detection_ex.exe C:\dlib-18.18\examples\faces\*.jpg ---then the execption is through. I understand why this is happening. Its not getting the image name.
My question now is how to read images from a folder through the command prompt in windows environment just like centos (linux) as mentioned in the example code of the face_detection_ex.

Thanking You.

Davis King said...

You have to give an explicit filename on windows because the windows command prompt doesn't support wildcards.

Unknown said...

Hello,

Okay, any other alternatives, other than that. Is there any possibility of reading a complete image folder inside the code and then passing it one by one to the load_image function just like OpenCV.

Thanking You.

Unknown said...
This comment has been removed by the author.
Unknown said...

Hi, You have given an example of code for entity extraction and labeling. Can you please provide the same for finding out all the instances of the Obama binary relationship example? Thanks.

Davis King said...

There is already a relation extraction example program here: https://github.com/mit-nlp/MITIE/blob/master/examples/python/train_relation_extraction.py

Unknown said...

Hi Davis, MITIE is awesome. Can ypu please tell which algorithms it is using for NER and binary relation extraction?> Is there some paper explaining the models and training data used?

Davis King said...

There is no paper. It uses eigenword embeddings and a structural SVM to do the NER segmentation. If you want details the code is well commented and most of MITIE is a thin layer on top of dlib, which is documented in great detail.