Tuesday, October 21, 2014

MITIE v0.3 Released: Now with Java and R APIs

We just made the next release of MITIE, a new DARPA funded information extraction tool being created by our team at MIT. This release is relatively minor and just adds APIs for Java and R.  The project page on github explains how to get started using either of these APIs.  

I want to take some time and explain how the Java API is implemented since, as I discovered while making MITIE's Java API, there aren't clear instructions for doing this anywhere on the internet. So hopefully this little tutorial will help you if you decide to make a similar Java binding to a C++ library.  So to begin, let's think about the requirements for a good Java binding:
  • You should be able to compile it from source with a simple command
  • A user of your library should not need to edit or configure anything to compile the API
  • The compilation process should work on any platform
  • Writing JNI is awful so you shouldn't have to do that
This pretty much leads you to Swig and CMake which are both great tools.  However, finding out how to get CMake to work with Swig was painful and is pretty much what this blog post is about.  Happily, it's possible to do and results in a very clean and easy to use mechanism for creating Java APIs.  In particular, you can compile MITIE's Swig/CMake based Java API using the usual CMake commands:
mkdir build
cd build
cmake ..
cmake --build . --config Release --target install
That creates a jar file and shared library file which together form the MITIE Java API.  Let's run through a little example to see how you can define new Java APIs.  Imagine you have created a simple C++ API that looks like this:
void printSomeString (const std::string& message);

class MyClass {
public:
    std::vector<std::string> getSomeStrings() const;
};
and you want to be able to use it from Java.  You just need to put this C++ API in a header file called swig_api.h and include some Swig commands that tell it what to call std::vector<std::string> in the generated Java API.  So the contents of swig_api.h would look like:
// Define some swig type maps that tell swig what to call various instantiations of
// std::vector.
#ifdef SWIG
%include "std_string.i"
%include "std_vector.i"
%template(StringVector)         std::vector<std::string>;
#endif

#include <string>
#include <vector>

void printSomeString (const std::string& message);

class MyClass {
public:
    std::vector<std::string> getSomeStrings() const;
};
The next step is to create a CMakeLists.txt file that tells CMake how to compile your API.  In our case, it would look like:

cmake_minimum_required (VERSION 2.8.4)

project(example)
set(java_package_name  edu.mit.ll.example)

# List the source files you want to compile into the Java API.  These contain 
# things like implementations of printSomeString() and whatever else you need.
set(source_files my_source.cpp another_source_file.cpp )

# List the folders that contain your header files
include_directories( . )

# List of libraries to link to.  For example, you might need to link to pthread
set(additional_link_libraries pthread)

# Tell CMake to put the compiled shared library and example.jar file into the
# same folder as this CMakeLists.txt file when the --target install option is
# executed. You can put any folder here, just give a path that is relative to
# the CMakeLists.txt file.
set(install_target_output_folder .)

include(cmake_swig_jni)
That's it.  Now you can compile your Java API using CMake and you will get an example.jar and example.dll or libexample.so file depending on your platform.  Then to use it you can write java code like this:
import edu.mit.ll.example.*;
public class Example {
    public static void main(String args[]) {
        global.printSomeString("hello world!");

        MyClass obj = new MyClass();
        StringVector temp = obj.getSomeStrings();
        for (int i = 0; i < temp.size(); ++i)
            System.out.println(temp.get(i));
    }
}
and execute it via:
javac -classpath example.jar  Example.java
java -classpath example.jar;. -Djava.library.path=. Example

assuming the examle.jar and shared library are in your current folder.  Note that Linux or OS X users will need to use a : as the classpath separator rather than ; as is required on Windows.  But that's it!  You just made a Java interface to your C++ library.  You might have noticed the include(cmake_swig_jni) statement though.  That is a bunch of CMake magic I had to write to make all this work, but work it does and on different platforms without trouble.  You can see a larger example of a Java to C++ binding in MITIE's github repo using this same setup.


8 comments :

beelz said...

Out of curiosity, why not use something like BridJ? It seems like you'd be able to access the native APIs without needing any separate compilation step.

geggo said...

@beelz Currently BridJ doesn't support stl types.

Davis King said...

We still need to compile MITIE itself and we are using CMake for that as well. So having Swig create a Java API as a part of the CMake script doesn't add any extra compiler run or extra step for the user of MITIE.

Swig also seems to be much more mature and doesn't include any runtime dependencies in the output so that's convenient as well.

Stefanelus said...
This comment has been removed by the author.
Stefanelus said...

Dear Davis,

in the Mitie and binary relation extraction I notice a type of relation called time.event.people_involved.

These means that I can train a classifier between an entity and something like a number, date, time, etc)

For e.g. if I have a product I would like to know the price of it. It is possible ?

I was thinking that relations can be only between entities.

Best regards,
Stefan

Davis King said...

It only works between entities. However, you can train your own entity extractor which will label whatever you want as an entity.

Stefanelus said...

Dear Davis,

that is really cool, I also found the sample in how to train your own entities. To get good results how many samples should I provide ? Any suggestion on that ?

Best regards,
Stefan

Davis King said...

How many samples you need depends on how much variability there is in the entity you want to detect. So you might only need 50 examples or you might need a whole lot (e.g. many thousands). The only way to know is to try.