dlib C++ Library: MITIE: A completely free and state-of-the-art information extraction tool

Thursday, April 3, 2014

MITIE: A completely free and state-of-the-art information extraction tool

I work at a MIT lab and there are a lot of cool things about my job. In fact, I could go on all day about it, but in this post I want to talk about one thing in particular, which is that we recently got funded by the DARPA XDATA program to make an open source natural language processing library focused on information extraction.

Why make such a thing when there are already open source libraries out there for this (e.g. OpenNLP, NLTK, Stanford IE, etc.)? Well, if you look around you quickly find out that everything which exists is either expensive, not state-of-the-art, or GPL licensed. If you wanted to use this kind of NLP tool in a non-GPL project then you are either out of luck, have to pay a lot of money, or settle for something of low quality. Well, not anymore! We just released the first version of our MIT Information Extraction library which is built using state-of-the-art statistical machine learning tools.

At this point it has just a C API and an example program showing how to do English named entity recognition. Over the next few weeks we will be adding bindings for other languages like Pyhton and Java. We will also be adding a lot more NLP tools in addition to named entity recognition, starting with relation extractors and part of speech taggers. But in the meantime you can use the C API or the streaming command line program. For example, if you had the following text in a file called sample_text.txt:

Meredith Vieira will become the first woman to host Olympics primetime coverage on her own when she fills on Friday night for the ailing Bob Costas, who is battling a continuing eye infection.

Then you can simply run:

cat sample_text.txt | ./ner_stream MITIE-models/ner_model.dat

And you get this as output:

[PERSON Meredith Vieira] will become the first woman to host [MISC Olympics] primetime coverage on her own when she fills on Friday night for the ailing [PERSON Bob Costas] , who is battling a continuing eye infection .

It's all up on github so if you want to try it out yourself then just run these commands and off you go:

git clone https://github.com/mit-nlp/MITIE.git
cd MITIE
./fetch_submodules.sh
make examples
make MITIE-models
cat sample_text.txt | ./ner_stream MITIE-models/ner_model.dat

23 comments :

Marco Lui said...: Thanks for this post. I tried following your instructions, however it seems that in your minimal example, a `make ner_stream` step is missing.; April 17, 2014 at 3:14 AM
Davis King said...: Running make examples should compile ner_stream. Did you do that step or is that not working on your computer?; April 29, 2014 at 1:51 PM
Anonymous said...: Can't wait for the Java release! Thanks for sharing; May 9, 2014 at 1:17 AM
Asterisk14 said...: When I am trying to execute ./fetch_submodules.sh, it's failing complaining about git command proper usage. Can you please confirm this, I am not sure how ^M is present at the end of line: git submodule update --init^M; June 8, 2014 at 7:41 AM
Davis King said...: That file ends with a UNIX line ending. Normally ^M appears when there is also the byte for line endings used by windows but it's not in MITIE's git repo that way. Maybe you have setup git to convert line endings on files, are using windows, and the git command gets upset about seeing windows line ending patterns?; June 8, 2014 at 9:59 AM
Davis King said...: That file ends with a UNIX line ending. Normally ^M appears when there is also the byte for line endings used by windows but it's not in MITIE's git repo that way. Maybe you have setup git to convert line endings on files, are using windows, and the git command gets upset about seeing windows line ending patterns?; June 8, 2014 at 9:59 AM
Sudarshan said...: This comment has been removed by the author.; June 21, 2014 at 3:55 PM
Sudarshan said...: This comment has been removed by the author.; June 21, 2014 at 3:55 PM
Sudarshan said...: Any benchmarks that you have done against existing named entity annotation data sets like ACE or CoNLL ?; June 21, 2014 at 3:55 PM
Davis King said...: The model that comes with MITIE is trained based on the CoNLL 2003 shared task so we did that evaluation and got an F1 score of 88.1 which is quite good. https://github.com/mit-nlp/MITIE/wiki/Evaluation; June 21, 2014 at 6:27 PM
Anonymous said...: Fantastic! Very nicely done... Thank you and congratulations.; September 4, 2014 at 12:07 PM
Stefanelus said...: Hi Davis,

It is possible to do stemming with dlib and extract symbols and stuff ? I trying to extract a feature descriptor from sentences and run it in a classifier from dlib.

Stefan; November 18, 2015 at 4:45 PM
Stefanelus said...: In mitie there is steam.c. I'll check that.; November 18, 2015 at 4:50 PM
lunakid said...: Stefan, what have you found?; May 4, 2016 at 6:08 PM
Unknown said...: Hi Davis,

Thanks for this great piece of software.

I'd like to run multiple instances of MITIE NamedEntityExtractor (from Java) in parallel but I can't afford to load as many model instances as extractor instances.

The documentation from mitie.h (1) seems to state that NamedEntityExtractor shall not be used concurrently. Is that right?

Then, is there at least any means to reuse a loaded model; ie, not to load a new model instance in memory (say english) for each new NamedEntityExtractor instance?

Thanks very much by advance,
Julien

(1) https://github.com/mit-nlp/MITIE/blob/master/mitielib/include/mitie.h; September 19, 2016 at 9:43 AM
Davis King said...: The documentation is correct. You can't share objects between threads without synchronization.; September 19, 2016 at 10:02 AM
Unknown said...: Do you think you decouple models from extractors so that one model can be reused by different NamedEntityExtractor instances as it is the case for OpenNLP?

Thanks!; September 19, 2016 at 10:13 AM
Davis King said...: Let's be clear, you are asking about multi-core processing. Yes, I'm sure it's possible to make MITIE take better advantage of multi-core CPUs. I'm not going to be doing that myself any time soon though. But pull requests are always welcome :); September 19, 2016 at 12:07 PM
Unknown said...: I think the question was clear from the beginning; so is the answer now :)

Thanks anyway; September 19, 2016 at 12:15 PM
Pradeep Sharma said...: This looks promising!; January 28, 2017 at 5:13 AM
Achyuta said...: @Julien Martin . Did you get any solution for this problem ?; February 28, 2018 at 2:18 AM
Achyuta said...: Hi I am using mitie ner. But now I am facing performance issue. I am using 5 core machine with 8 GB ram. I have 200 number of sentences with labels.but problem is , it is taking time more than 1 hours.

What could be the possible mistake I am doing ?

Thanks in advance.; February 28, 2018 at 2:28 AM
Davis King said...: You have to install the python header files. If you google your error you will get a whole lot of hits.; October 22, 2018 at 9:33 PM