dlib C++ Library

Automatic Learning Rate Scheduling That Really Works

2018-02-13T08:02:00.000-05:00

Training deep learning models can be a pain. In particular, there is this perception that one of the reasons it's a pain is because you have to fiddle with learning rates. For example, arguably the most popular strategy for setting learning rates looks like this:

Run vanilla stochastic gradient descent with momentum and a fixed learning rate
Wait for the loss to stop improving
Reduce the learning rate
Go back to step 1 or stop if the learning rate is really small

Many papers reporting state-of-the-art results do this. There have been a lot of other methods proposed, like ADAM, but I've always found the above procedure to work best. This is a common finding. The only fiddly part of this procedure is the "wait for the loss to stop improving" step. A lot of people just eyeball a plot of the loss and manually intervene when it looks like its flattened out. Or worse, they pick a certain number of iterations ahead of time and blindly stop when that limit is reached. Both of these ways of deciding when to reduce the learning rate suck.

Fortunately, there is a simple method from classical statistics we can use to decide if the loss is still improving, and thus, when to reduce it. With this method it's trivial to fully automate the above procedure. In fact, it's what I've used to train all the public DNN models in dlib over the last few years: e.g. face detection, face recognition, vehicle detection, and imagenet classification. It's the default solving strategy used by dlib's DNN solver. The rest of this blog post explains how it works.

Fundamentally, what we need is a method that takes a noisy time series of $n$ loss values, $Y=\{y_0,y_1,y_2,\dots,y_{n-1}\}$, and tells us if the time series is trending down or not. To do this, we model the time series as a noisy line corrupted by Gaussian noise:
\[
\newcommand{\N} {\mathcal{N} } y_i = m\times i + b + \epsilon
\] Here, $m$ and $b$ are the unknown true slope and intercept parameters of the line, and $\epsilon$ is a Gaussian noise term with mean 0 and variance $\sigma^2$. Let's also define the function $\text{slope}(Y)$ that takes in a time series, performs OLS, and outputs the OLS estimate of $m$, the slope of the line. You can then ask the following question: what is the probability that a time series sampled from our noisy line model will have a negative slope according to OLS? That is, what is the value of?
\[
P(\text{slope}(Y) < 0)
\]If we could compute an estimate of $P(\text{slope}(Y)<0)$ we could use it to test if the loss is still decreasing. Fortunately, computing the above quantity turns out to be easy. In fact, $\text{slope}(Y)$ is a Gaussian random variable with this distribution:
\[
\text{slope}(Y) \sim \N\left(m, \frac{12 \sigma^2}{n^3-n}\right)
\]We don't know the true values of $m$ and $\sigma^2$, but they are easily estimated from data. We can obviously use $\text{slope}(Y)$ to estimate $m$. As for $\sigma^2$, it's customary to estimate it like this:
\[ \sigma^2 = \frac{1}{n-2} \sum_{i=0}^{n-1} (y_i - \hat y_i)^2 \] which gives an unbiased estimate of the true $\sigma^2$. Here $y_i - \hat y_i$ is the difference between the observed time series value at time $i$ and the value predicted by the OLS fitted line at time $i$. I should point out that none of this is new stuff, in fact, these properties of OLS are discussed in detail on the Wikipedia page about OLS.

So let's recap. We need a method to decide if the loss is trending downward or not. I'm suggesting that you use $P(\text{slope}(Y) < 0)$, the probability that a line fit to your loss curve will have negative slope. Moreover, as discussed above, this probability is easy to compute since it's just a question about a simple Gaussian variable and the two parameters of the Gaussian variable are given by a straightforward application of OLS.

You should also note that the variance of $\text{slope}(Y)$ decays at the very quick rate of $O(1/n^3)$, where $n$ is the number of loss samples. So it becomes very accurate as the length of the time series grows. To illustrate just how accurate this is, let's look at some examples. The figure below shows four different time series plots, each consisting of $n=4000$ points. Each plot is a draw from our noisy line model with parameters: $b=0$, $\sigma^2=1$, and $m \in \{0.001, 0.00004, -0.00004, -0.001\}$. For each of these noisy plots I've computed $P(\text{slope}(Y) < 0)$ and included it in the title.

From looking at these plots it should be obvious that $P(\text{slope}(Y) < 0)$ is quite good at detecting the slope. In particular, I doubt you can tell the difference between the two middle plots (the ones with slopes -0.00004 and 0.00004). But as you can see, the test statistic I'm suggesting, $P(\text{slope}(Y) < 0)$, has no trouble at all correctly identifying one as sloping up and the other as sloping down.

I find that a nice way to parameterize this in actual code is to count the number of mini-batches that executed while $P(\text{slope}(Y) < 0) < 0.51$. That is, find out how many loss values you have to look at before there is evidence the loss has been decreasing. To be very clear, this bit of pseudo-code implements the idea:

    def count_steps_without_decrease(Y):
        steps_without_decrease = 0
        n = len(Y)
        for i in reversed(range(n)):
            if P(slope(Y[i:n]) < 0) < 0.51:
                steps_without_decrease = n-i
        return steps_without_decrease

You can then use a rule like: "if the steps without decrease is 1000 I will lower the learning rate by 10x". However, there is one more issue that needs to be addressed. This is the fact that loss curves sometimes have really large transient spikes, where, for one reason or another (e.g. maybe a bad mini-batch) the loss will suddenly become huge for a moment. Not all models or datasets have this problem during training, but some do. In these cases, count_steps_without_decrease() might erroneously return a very large value. You can deal with this problem by discarding the top 10% of loss values inside count_steps_without_decrease(). This makes the entire procedure robust to these noisy outliers. Note, however, that the final test you would want to use is:

count_steps_without_decrease(Y) > threshold and count_steps_without_decrease_robust(Y) > threshold

That is, perform the check with and without outlier discarding. You need both checks because the 10% largest loss values might have occurred at the very beginning of Y. For example, maybe you are waiting for 1000 (i.e. threshold=1000) mini-batches to execute without showing evidence of the loss going down. And maybe the first 100 all showed a dropping loss while the last 900 were flat. The check that discarded the top 10% would erroneously indicate that the loss was NOT dropping. So you want to perform both checks and if both agree that the loss isn't dropping then you can be confident it's not dropping.

It should be emphasized that this method isn't substantively different from what a whole lot of people already do when training deep neural networks. The only difference here is that the "look at the loss and see if it's decreasing" step is being done by a computer. The point of this blog post is to point out that this check is trivially automatable with boring old simple statistics. There is no reason to do it by hand. Let the computer do it and find something more productive to do with your time than babysitting SGD. The test is simple to implement yourself, but if you want to just call a function you can call dlib's count_steps_without_decrease() and count_steps_without_decrease_robust() routines from C++ or Python.

Finally, one more useful thing you can do is the following: you can periodically check if $P(\text{slope}(Y) > 0) \gt 0.99$, that is, check if we are really certain that the loss is going up, rather than down. This can happen and I've had training runs that were going fine and then suddenly the loss shot up and stayed high for a really long time, basically ruining the training run. This doesn't seem to be too much of an issue with simple losses like the log-loss. However, structured loss functions that perform some kind of hard negative mining inside a mini-batch will sometimes go haywire if they hit a very bad mini-batch. You can fix this problem by simply reloading from an earlier network state before the loss increased. But to do this you need a reliable way to measure "the loss is going up" and $P(\text{slope}(Y) > 0) \gt 0.99$ is excellent for this task. This idea is called backtracking and has a long history in numerical optimization. Backtracking significantly increases solver robustness in many cases and is well worth using.

Correctly Mirroring Datasets

2018-01-14T11:09:00.001-05:00

I get asked a lot of questions about dlib's landmarking tools. Some of the most common questions are about how to prepare a good training dataset. One of the most useful tricks for creating a dataset is to mirror the data, since this effectively doubles the amount of training data. However, if you do this naively you end up with a terrible training dataset that produces really awful landmarking models. Some of the most common questions I get are about why this is happening.

To understand the issue, consider the following image of an annotated face from the iBug W-300 dataset:

Since the mirror image of a face is still a face, we can mirror images like this to get more training data. However, what happens if you simply mirror the annotations? You end up with the wrong annotation labels! To see this, take a look at the figure below. The left image shows what happens if you naively mirror the above image and its landmarks. Note, for instance, that the points along the jawline are now annotated in reverse order. In fact, nearly all the annotations in the left image are wrong. Instead, you want to match the source image's labeling scheme. A mirrored image with the correct annotations is shown on the right.

Dlib's imglab tool has had a --flip option for a long time that would mirror a dataset for you. However, it used naive mirroring and it was left up to the user to adjust any landmark labels appropriately. Many users found this confusing, so in the new version of imglab (v1.13) the --flip command now performs automatic source label matching using a 2D point registration algorithm. That is, it left-right flips the dataset and annotations. Then it registers the mirrored landmarks with the original landmarks and transfers labels appropriately. In fact, the "source label matching" image on the right was created by the new version of imglab.

Finally, just to be clear, the point registration algorithm will work on anything. It doesn't have to be iBug's annotations. It doesn't have to be faces. It's a general point registration method that will work correctly for any kind of landmark annotated data with left-right symmetry. However, if you want the old --flip behavior you can use the new --flip-basic to get a naive mirroring. But most users will want to use the new --flip.

A Global Optimization Algorithm Worth Using

2017-12-28T11:09:00.002-05:00

Here is a common problem: you have some machine learning algorithm you want to use but it has these damn hyperparameters. These are numbers like weight decay magnitude, Gaussian kernel width, and so forth. The algorithm doesn't set them, instead, it's up to you to determine their values. If you don't set these parameters to "good" values the algorithm doesn't work. So what do you do? Well, here is a list of everything I've seen people do, listed in order of most to least common:

Guess and Check: Listen to your gut, pick numbers that feel good and see if they work. Keep doing this until you are tired of doing it.
Grid Search: Ask your computer to try a bunch of values spread evenly over some range.
Random Search: Ask your computer to try a bunch of values by picking them randomly.
Bayesian Optimization: Use a tool like MATLAB's bayesopt to automatically pick the best parameters, then find out Bayesian Optimization has more hyperparameters than your machine learning algorithm, get frustrated, and go back to using guess and check or grid search.
Local Optimization With a Good Initial Guess: This is what MITIE does, it uses the BOBYQA algorithm with a well chosen starting point. Since BOBYQA only finds the nearest local optima the success of this method is heavily dependent on a good starting point. In MITIE's case we know a good starting point, but this isn't a general solution since usually you won't know a good starting point. On the plus side, this kind of method is extremely good at finding a local optima. I'll have more to say on this later.

The vast majority of people just do guess and check. That sucks and there should be something better. We all want some black-box optimization strategy like Bayesian optimization to be useful, but in my experience, if you don't set its hyperparameters to the right values it doesn't work as well as an expert doing guess and check. Everyone I know who has used Bayesian optimization has had the same experience. Ultimately, if I think I can do better hyperparameter selection manually then that's what I'm going to do, and most of my colleagues feel the same way. The end result is that I don't use automated hyperparameter selection tools most of the time, and that bums me out. I badly want a parameter-free global optimizer that I can trust to do hyperparameter selection.

So I was very excited when I encountered the paper Global optimization of Lipschitz functions by Cédric Malherbe and Nicolas Vayatis in this year's international conference on machine learning. In this paper, they propose a very simple parameter-free and provably correct method for finding the $x \in \mathbb{R}^d$ that maximizes a function, $f(x)$, even if $f(x)$ has many local maxima. The key idea in their paper is to maintain a piecewise linear upper bound of $f(x)$ and use that to decide which $x$ to evaluate at each step of the optimization. So if you already evaluated the points $x_1, x_2, \cdots, x_t$ then you can define a simple upper bound on $f(x)$ like this:
\[ \newcommand{\norm}[1]{\left\lVert#1\right\rVert} U(x) = \min_{i=1\dots t} (f(x_i) + k \cdot \norm{x-x_i}_2 ) \] Where $k$ is the Lipschitz constant for $f(x)$. Therefore, it is trivially true that $U(x) \geq f(x), \forall x$, by the definition of the Lipschitz constant. The authors go on to suggest a simple algorithm, called LIPO, that picks points at random, checks if the upper bound for the new point is better than the best point seen so far, and if so selects it as the next point to evaluate. For example, the figure below shows a plot of a simple $f(x)$ in red with a plot of its associated upper bound $U(x)$ in green. In this case $U(x)$ is defined by 4 points, indicated here with little black squares.

It shouldn't take a lot of imagination to see how the upper bound helps you pick good points to evaluate. For instance, if you selected the max upper bound as the next iterate you would already get pretty close to the global maximizer. The authors go on to prove a bunch of nice properties of this method. In particular, they both prove mathematically and show empirically that the method is better than random search in a number of non-trivial situations. This is a fairly strong statement considering how competitive random hyperparameter search turns out to be relative to competing hyperparameter optimization methods. They also compare the method to other algorithms like Bayesian optimization and show that it's competitive.

But you are probably thinking: "Hold on a second, we don't know the value of the Lipschitz constant $k$!". This isn't a big deal since it's easily estimated, for instance, by setting $k$ to the largest observed slope of $f(x)$ before each iteration. That's equivalent to solving the following easy problem:
\begin{align}
\min_{k} & \quad k^2 \\
\text{s.t.} & \quad U(x_i) \geq f(x_i), \quad \forall i \in [1\dots t] \\
& \quad k \geq 0
\end{align} Malherbe et al. test a variant of this $k$ estimation approach and show it works well.

This is great. I love this paper. It's proposing a global optimization method called LIPO that is both parameter free and provably better than random search. It's also really simple. Reading this paper gives you one of those "duah" moments where you wonder why you didn't think of this a long time ago. That's the mark of a great paper. So obviously I was going to add some kind of LIPO algorithm to dlib, which I did in the recent dlib v19.8 release.

However, if you want to use LIPO in practice there are some issues that need to be addressed. The rest of this blog post discusses these issues and how the dlib implementation addresses them. First, if $f(x)$ is noisy or discontinuous even a little it's not going to work reliably since $k$ will be infinity. This happens in real world situations all the time. For instance, evaluating a binary classifier against the 0-1 loss gives you an objective function with little discontinuities anywhere samples switch their predicted class. You could cross your fingers and run LIPO anyway, but you run the very real risk of two $x$ samples closely straddling a discontinuity and causing the estimated $k$ to explode. Second, not all hyperparameters are equally important, some hardly matter while small changes in others drastically affect the output of $f(x)$. So it would be nice if each hyperparameter got its own $k$. You can address these problems by defining the upper bound $U(x)$ as follows:
\[ U(x) = \min_{i=1\dots t} \left[ f(x_i) + \sqrt{\sigma_i +(x-x_i)^\intercal K (x-x_i)} \ \right] \] Now each sample from $f(x)$ has its own noise term, $\sigma_i$, which should be 0 most of the time unless $x_i$ is really close to a discontinuity or there is some stochasticity. Here, $K$ is a diagonal matrix that contains our "per hyperparameter Lipschitz $k$ terms". With this formulation, setting each $\sigma$ to 0 and $K=k^2I$ gives the same $U(x)$ as suggested by Malherbe et al., but if we let them take more general values we can deal with the above mentioned problems.

Just like before, we can find the parameters of $U(x)$ by solving an optimization problem:
\begin{align}
\min_{K,\sigma} & \quad \norm{K}^2_F + 10^6 \sum_{i=1}^t {\sigma_i^2} &\\
\text{s.t.} & \quad U(x_i) \geq f(x_i), & \quad \forall i \in [1\dots t] \\
& \quad \sigma_i \geq 0 & \quad \forall i \in [1\dots t] \\
& \quad K_{i,j} \geq 0 & \quad \forall i,j \in [1\dots d] \\
& \quad \text{K is a diagonal matrix}
\end{align} The $10^6$ penalty on $\sigma^2$ causes most $\sigma$ terms to be exactly 0. The behavior of the whole algorithm is insensitive to the particular penalty value used here, so long as it's reasonably large the $\sigma$ values will be 0 most of the time while still preventing $k$ from becoming infinite, which is the behavior we want. It's also possible to rewrite this as a big quadratic programming problem and solve it with a dual coordinate descent method. I'm not going into the details here. It's all in the dlib code for those really interested. The TL;DR is that it turns out to be easy to solve using well known methods and it fixes the infinite $k$ problem.

The final issue that needs to be addressed is LIPO's terrible convergence in the area of a local maximizer. So while it's true that LIPO is great at getting onto the tallest peak of $f(x)$, once you are there it does not make very rapid progress towards the optimal location (i.e. the very top of the peak). This is a problem shared by many derivative free optimization algorithms, including MATLAB's Bayesian optimization tool. Fortunately, not all methods have this limitation. In particular, the late and great Michael J. D. Powell wrote a series of papers on how to apply classic trust region methods to derivative free optimization. These methods fit a quadratic surface around the best point seen so far and then take the next iterate to be the maximizer of that quadratic surface within some distance of the current best point. So we "trust" this local quadratic model to be accurate within some small region around the best point, hence the name "trust region". The BOBYQA method I mentioned above is one of these methods and it has excellent convergence to the nearest local optima, easily finding local optima to full floating point precision in a very small number of steps.

We can fix LIPO's convergence problem by combining these two methods, LIPO will explore $f(x)$ and quickly find a point on the biggest peak. Then a Powell style trust region method can efficiently find the exact maximizer of that peak. The simplest way to combine these two things is to alternate between them, which is what dlib does. On even iterations we pick the next $x$ according to our upper bound while on odd iterations we pick the next $x$ according to the trust region model. I've also used a slightly different version of LIPO that I'm calling MaxLIPO. Recall that Malherbe et al. suggest selecting any point with an upper bound larger than the current best objective value. However, I've found that selecting the maximum upper bounding point on each iteration is slightly better. This alternative version, MaxLIPO, is therefore what dlib uses. You can see this hybrid of MaxLIPO and a trust region method in action in the following video:

Video of optimizer running
In the video, the red line is the function to be optimized and we are looking for the maximum point. Every time the algorithm samples a point from the function we note it with a little box. The state of the solver is determined by the global upper bound $U(x)$ and the local quadratic model used by the trust region method. Therefore, we draw the upper bounding model as well as the current local quadratic model so you can see how they evolve as the optimization proceeds. We also note the location of the best point seen so far by a little vertical line.

You can see that the optimizer is alternating between picking the maximum upper bounding point and the maximum point according to the quadratic model. As the optimization proceeds, the upper bound becomes progressively more accurate, helping to find the best peak to investigate, while the quadratic model quickly finds a high precision maximizer on whatever peak it currently rests. These two things together allow the optimizer to find the true global maximizer to high precision (within $\pm{10^{-9}}$ in this case) by the time the video concludes.

The Holder Table Test Function
from https://en.wikipedia.org/wiki/File:Holder_table_function.pdf

Now let's do an experiment to see how this hybrid of MaxLIPO and Powell's trust region method (TR) compares to MATLAB's Bayesian optimization tool with its default settings. I ran both algorithms on the Holder table test function 100 times and plotted the average error with one standard deviation error bars. So the plot below shows $f(x^\star)-f(x_i)$, the difference between the true global optimum and the best solution found so far, as a function of the number of calls to $f(x)$. You can see that MATLAB's BayesOpt stalls out at an accuracy of about $\pm{10^{-3}}$ while our hybrid method (MaxLIPO+TR, the new method in dlib) quickly approaches full floating point precision of around $\pm{10^{-17}}$.

I also reran some of the tests from Figure 5 of the LIPO paper. The results are shown in the table below. In these experiments I compared the performance of LIPO with and without the trust region solver (LIPO+TR and LIPO). Additionally, to verify that LIPO is better than pure random search I tested a version of the algorithm that alternates between pure random search and the trust region solver (PRS+TR) rather than alternating between a LIPO method and a trust region solver (LIPO+TR and MaxLIPO+TR). Pure random search (PRS) is also included for reference. Finally, the new algorithm implemented in dlib, MaxLIPO+TR, is included as well. In each test I ran the algorithm 1000 times and recorded the mean and standard deviation of the number of calls to $f(x)$ required to reach a particular solution accuracy. For instance, $\epsilon=0.01$ means that $f(x^\star)-f(x_i) \leq 0.01$, while "target 99%" uses the "target" metric from Malherbe's paper, which for most tests corresponds to an $\epsilon > 0.1$. Tests that took too long to execute are noted with a - symbol.

The key points to notice about these results are that the addition of a trust region method allows LIPO to reach much higher solution accuracy. It also makes the algorithm run faster. Recall that LIPO works internally by using random search of $U(x)$. Therefore, the number of calls LIPO makes to $U(x)$ is at least as many as PRS would require when searching $f(x)$. So for smaller $\epsilon$ it becomes very expensive to execute LIPO. For instance, I wasn't able to get results for LIPO, by itself, at accuracies better than $0.1$ on any of the test problems since it took too long to execute. However, with a trust region method the combined algorithm can easily achieve high precision solutions. The other significant detail is that, for tests with many local optima, all methods combining LIPO with TR are much better than PRS+TR. This is most striking on ComplexHolder, which is a version of the HolderTable test function with additional high frequency sinusoidal noise that significantly increases the number of local optima. On ComplexHolder, LIPO based methods require about an order of magnitude fewer calls to $f(x)$ than PRS+TR, further justifying the claims by Malherbe et al. of the superiority of LIPO relative to pure random search.

The new method in dlib, MaxLIPO+TR, fares the best in all my tests. What is remarkable about this method is its simplicity. In particular, MaxLIPO+TR doesn't have any hyperparameters, making it very easy to use. I've been using it for a while now for hyperparameter optimization and have been very pleased. It's the first black-box hyperparameter optimization algorithm I've had enough confidence in to use on real problems.

Finally, here is an example of how you can use this new optimizer from Python:

def holder_table(x0,x1):
    return -abs(sin(x0)*cos(x1)*exp(abs(1-sqrt(x0*x0+x1*x1)/pi)))

x,y = dlib.find_min_global(holder_table, 
                           [-10,-10],  # Lower bound constraints on x0 and x1 respectively
                           [10,10],    # Upper bound constraints on x0 and x1 respectively
                           80)         # The number of times find_min_global() will call holder_table()

Or in C++11:

auto holder_table = [](double x0, double x1) {return -abs(sin(x0)*cos(x1)*exp(abs(1-sqrt(x0*x0+x1*x1)/pi)));};

// obtain result.x and result.y
auto result = find_min_global(holder_table, 
                             {-10,-10}, // lower bounds
                             {10,10}, // upper bounds
                             max_function_calls(80));

Both of these methods find holder_table's global optima to about 12 digits of precision in about 0.1 seconds. The C++ API exposes a wide range of ways to call the solver, including optimizing multiple functions at a time and adding integer constraints. See the documentation for full details.

Dlib 19.8 is Out

2017-12-20T06:27:00.000-05:00

Dlib 19.8 is officially out. There are a lot of changes, but the two most interesting ones are probably the new global optimizer and semantic segmentation examples. The global optimizer is definitely my favorite as it allows you to easily find the optimal hyperparameters for machine learning algorithms. It also has a very convenient syntax. For example, consider the Holder table test function:

From https://en.wikipedia.org/wiki/File:Holder_table_function.pdf

Here is how you could use dlib's new optimizer from Python to optimize the difficult Holder table function:

def holder_table(x0,x1):
    return -abs(sin(x0)*cos(x1)*exp(abs(1-sqrt(x0*x0+x1*x1)/pi)))

x,y = dlib.find_min_global(holder_table, 
                           [-10,-10],  # Lower bound constraints on x0 and x1 respectively
                           [10,10],    # Upper bound constraints on x0 and x1 respectively
                           80)         # The number of times find_min_global() will call holder_table()

Or in C++:

auto holder_table = [](double x0, double x1) {return -abs(sin(x0)*cos(x1)*exp(abs(1-sqrt(x0*x0+x1*x1)/pi)));};

// obtain result.x and result.y
auto result = find_min_global(holder_table, 
                             {-10,-10}, // lower bounds
                             {10,10}, // upper bounds
                             max_function_calls(80));

Both of these methods find holder_table's global optima to about 12 digits of precision in about 0.1 seconds. The documentation has much more to say about this new tooling. I'll also make a blog post soon that goes into much more detail on how the method works.

Finally, here are some fun example outputs from the new semantic segmentation example program:

Fast Multiclass Object Detection in Dlib 19.7

2017-09-23T10:58:00.000-04:00

The new version of dlib is out and the biggest new feature is the ability to train multiclass object detectors with dlib's convolutional neural network tooling. The previous version only allowed you to train single class detectors, but this release adds the option to create single CNN models that output multiple labels. As an example, I created a small 894 image dataset where I annotated the fronts and rears of cars and used it to train a 2-class detector. You can see the resulting detector running in this video:

If you want to run the car detector from this video on your own images you can check out this example program.

I've also improved the detector speed in dlib 19.7 by pushing more of the processing to the GPU. This makes the detector 2.5x faster. For example, running the detector on the 928x478 image used in this example program ran at 39fps in the previous version of dlib, but now runs at 98fps (when run on a NVIDIA 1080ti).

This release also includes a new 5-point face landmarking model that finds the corners of the eyes and bottom of nose:

Unlike the 68-point landmarking model included with dlib, this model is over 10x smaller at 8.8MB compared to the 68-point model's 96MB. It also runs faster, and even more importantly, works with the state-of-the-art CNN face detector in dlib as well as the older HOG face detector in dlib. The central use-case of the 5-point model is to perform 2D face alignment for applications like face recognition. In any of the dlib code that does face alignment, the new 5-point model is a drop-in replacement for the 68-point model and in fact is the new recommended model to use with dlib's face recognition tooling.

Vehicle Detection with Dlib 19.5

2017-08-27T19:47:00.000-04:00

Dlib v19.5 is out and there are a lot of new features. There is a dlib to caffe converter, a bunch of new deep learning layer types, cuDNN v6 and v7 support, and a bunch of optimizations that make things run faster in different situations, like ARM NEON support, which makes HOG based detectors run a lot faster on mobile devices.

However, the coolest and most requested feature has been an upgrade to the CNN+MMOD object detector to support detecting things with varying aspect ratios. The previous version of the detector required the training data to consist of objects that all had essentially the same aspect ratio. This is fine for tasks like face detection and dog hipsterization, but obviously not as general as you would like.

So dlib v19.5 includes an updated version of the MMOD loss layer that can be used to learn an object detector from a dataset with any mixture of bounding box shapes and sizes. To demo this new feature, I used the new MMOD code to create a vehicle detector, which you can see running on these videos. This detector is trained to find cars moving with you in traffic, and therefore cars where the rear end of the vehicle is visible.

The detector is just as fast as previous versions of the CNN+MMOD detector. For instance, when I run it on my NVIDIA 1080ti I can process 39 frames per second when processing them individually and 93 frames per second when processing them grouped into batches. This assumes a frame size of 928x478.

If you want to run this detector yourself you can check out the new example program that does just that. The detector was trained on a modest dataset of 2217 images, which is also available, as is the training code. Both these new example programs contain a lot of information about training this kind of detector and are worth reading if you want to understand the details involved. However, we can go into a short description here to understand how the detector works.

Take this image as an example. I ran the new vehicle detector on it and plotted the resulting detections as red boxes. So what are the processing steps that go from the raw image to the 6 boxes? To roughly summarize, they are:

Create an image pyramid and pack the pyramid into one big image. Let's call this the "tiled pyramid"
Run the tiled pyramid image through a CNN. The CNN outputs a new image where bright pixels in the output image indicate the presence of cars.
Find pixels in the CNN's output image with a value > 0. Those locations are your preliminary car detections.
Perform non-maximum suppression on the preliminary detections to produce the final output.

Steps 3 and 4 are pretty straightforward. It's the first two steps that are complicated. So to understand them, let's visualize the outputs of these first two steps. All step 1 does is call dlib::create_tiled_pyramid on the input image to produce this new image:

What's special about this image is that we don't need to worry about scale anymore. That is, suppose we have a detection algorithm that can find cars, but it only knows how to find cars of a certain size. No problem. When you run it on this tiled pyramid image you are going to find each car somewhere in it at the scale your detector expects. Moreover, the tiled pyramid is only about 3.7 times larger than the original image, so processing it instead of the raw image gives us full scale invariance for only a 3.7x increase in computational cost. That's a very reasonable trade. Moreover, tiling it inside a rectangular image makes it very easy to process using normal CNN tooling on a GPU and still get full GPU speeds.

Now for step 2. The CNN takes the tiled pyramid as input, does a bunch of convolutions, and outputs a new set of images. In the case of our vehicle detector, it outputs 3 new images, each is a detection strength map that gets "hot" in locations likely to contain a vehicle. The reason there are 3 images for the vehicle detector is because there are, roughly, 3 different aspect ratios (tall and skinny e.g. semi trucks, short and wide e.g. sedans, and squarish e.g. SUVs). For purposes of display, I have combined the 3 images into one by taking the pointwise max of the 3 original images. You can see this combined image below. The dark blue areas are places the CNN is saying "definitely not a vehicle" and the bright red locations are the positions it thinks contain a vehicle.

If we overlay this CNN output on top of the tiled pyramid you can see it's doing the right thing. The cars get bright red dots on them, right in the centers of the cars. Moreover, you can tell that the CNN is only detecting cars at a certain scale. The smaller cars are detected at the top of the pyramid and only as we progress down the pyramid does it begin to detect the larger cars.

After the CNN output is obtained, all the detection code needs to do is threshold the CNN output, find all the hot spots, apply non-max suppression, and output the boxes corresponding to the identified hot spots. And that's it, that's all the CNN+MMOD detector is doing.

On the other hand, describing how the CNN is trained is more complicated. The code in dlib uses the usual stochastic gradient descent methods. You can see many of the details if you read the dlib DNN example programs. How deep learning works in general is a big topic, but the thing most interesting here is the MMOD loss layer. For the gory details on that I refer you to the MMOD paper which explains the loss function. In the paper it is discussed in the context of networks that are linear in their parameters rather than non-linear in their parameters, as is our CNN here. However, for understanding the loss the difference between linear vs. non-linear is a minor detail. In fact, the loss equations are the same for both cases. The only difference is what kind of optimization algorithms are available for each case. In the linear parameter case you can write a fancy numeric solver capable of solving the problem in a few minutes, but with a non-linear parameterization you have to resort to brute force SGD and GPUs running for many hours.

But at a very high level, it's running the entire detection process over and over during training, counting the number of detection mistakes (false alarms, missed detections, and duplicate detections), and back-propagating that error gradient through the CNN until the CNN stops messing up. Also, since the MMOD loss layer is counting mistakes after non-max suppression is applied, it knows that it needs to get the CNN to avoid producing high outputs in parts of the image that won't be suppressed by non-max suppression. This is why you see the dark blue areas of "definitely not a car" surrounding each of the car detections. The CNN has learned that it needs to be very careful on the border between "it's a car" and "it's not a car" to avoid accidentally detecting the same car multiple times.

This is perhaps easiest to see if we merge the pyramid layers back into the original image. If we make an image where the pixel value is the max over all scales in the pyramid we get this image:

Here you can clearly see the 6 car hotspots and the dark blue areas of "not a car" immediately surrounding them. Finally, overlaying this on the original image gives this wonderful image:

High Quality Face Recognition with Deep Metric Learning

2017-02-12T13:18:00.000-05:00

Since the last dlib release, I've been working on adding easy to use deep metric learning tooling to dlib. Deep metric learning is useful for a lot of things, but the most popular application is face recognition. So obviously I had to add a face recognition example program to dlib. The new example comes with pictures of bald Hollywood action heroes and uses the provided deep metric model to identify how many different people there are and which faces belong to each person. The input images are shown below along with the four automatically identified face clusters:

Just like all the other example dlib models, the pretrained model used by this example program is in the public domain. So you can use it for anything you want. Also, the model has an accuracy of 99.38% on the standard Labeled Faces in the Wild benchmark. This is comparable to other state-of-the-art models and means that, given two face images, it correctly predicts if the images are of the same person 99.38% of the time.

For those interested in the model details, this model is a ResNet network with 29 conv layers. It's essentially a version of the ResNet-34 network from the paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun with a few layers removed and the number of filters per layer reduced by half.

The network was trained from scratch on a dataset of about 3 million faces. This dataset is derived from a number of datasets. The face scrub dataset[2], the VGG dataset[1], and then a large number of images I personally scraped from the internet. I tried as best I could to clean up the combined dataset by removing labeling errors, which meant filtering out a lot of stuff from VGG. I did this by repeatedly training a face recognition model and then using graph clustering methods and a lot of manual review to clean up the dataset. In the end, about half the images are from VGG and face scrub. Also, the total number of individual identities in the dataset is 7485. I made sure to avoid overlap with identities in LFW so the LFW evaluation would be valid.

The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level. The training code is obviously also available, since that sort of thing is basically the point of dlib. You can find all details on training and model specifics by reading the example program and consulting the referenced parts of dlib. There is also a Python API for accessing the face recognition model.

[1] O. M. Parkhi, A. Vedaldi, A. Zisserman Deep Face Recognition British Machine Vision Conference, 2015.
[2] H.-W. Ng, S. Winkler. A data-driven approach to cleaning large face datasets. Proc. IEEE International Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014

Easily Create High Quality Object Detectors with Deep Learning

2016-10-11T06:59:00.000-04:00

A few years ago I added an implementation of the max-margin object-detection algorithm (MMOD) to dlib. This tool has since become quite popular as it frees the user from tedious tasks like hard negative mining. You simply label things in images and it learns to detect them. It also produces high quality detectors from relatively small amounts of training data. For instance, one of dlib's example programs shows MMOD learning a serviceable face detector from only 4 images.

However, the MMOD implementation in dlib used HOG feature extraction followed by a single linear filter. This means it's incapable of learning to detect objects that exhibit complex pose variation or have a lot of other variability in how they appear. To get around this, users typically train multiple detectors, one for each pose. That works OK in many cases but isn't a really good general solution. Fortunately, over the last few years convolutional neural networks have proven themselves to be capable of dealing with all these issues within a single model.

So the obvious thing to do was to add an implementation of MMOD with the HOG feature extraction replaced with a convolutional neural network. The new version of dlib, v19.2, contains just such a thing. On this page you can see a short tutorial showing how to train a convolutional neural network using the MMOD loss function. It uses dlib's new deep learning API to train the detector end-to-end on the very same 4 image dataset used in the HOG version of the example program. Happily, and very much to the surprise of myself and my colleagues, it learns a working face detector from this tiny dataset. Here is the detector run over an image not in the training data:

I expected the CNN version of MMOD to inherit the low training data requirements of the HOG version of MMOD, but working with only 4 training images is very surprising considering other deep learning methods typically require many thousands of images to produce any kind of sensible results.

The detector is also reasonably fast for a CNN. On the CPU, it takes about 370ms to process a 640x480 image. On my NVIDIA Titan X GPU (the Maxwell version, not the newer Pascal version) it takes 45ms to process an image when images are processed one at a time. If I group the images into batches then it takes about 18ms per image.

To really test the new CNN version of MMOD, I ran it through the leading face detection benchmark, FDDB. This benchmark has two modes, 10-fold cross-validation and unrestricted. Both test on the same dataset, but in the 10-fold cross-validation mode you are only allowed to train on data in the FDDB dataset. In the unrestricted mode you can train on any data you like so long as it doesn't include images from FDDB. I ran the 10-fold cross-validation version of the FDDB challenge. This means I trained 10 CNN face detectors, each on 9 folds and tested on the held out 10th. I did not perform any hyper parameter tuning. Then I ran the results through the FDDB evaluation software and got this plot:

The X axis is the number of false alarms produced over the entire 2845 image dataset. The Y axis is recall, i.e. the fraction of faces found by the detector. The green curve is the new dlib detector, which in this mode only gets about 4600 faces to train on. The red curve is the old Viola Jones detector which is still popular (although it shouldn't be, obviously). Most interestingly, the blue curve is a state-of-the-art result from the paper Face Detection with the Faster R-CNN, published only 4 months ago. In that paper, they train their detector on the very large WIDER dataset, which consists of 159,424 faces, and arguably get worse results on FDDB than the dlib detector trained on only 4600 faces.

As another test, I created the dog hipsterizer, which I made a post about a few days ago. The hipsterizer used the exact same code and parameter settings to train a dog head detector. The only difference was the training data consisted in 9240 dog heads instead of human faces. That produced the very high quality models used in the hipsterizer. So now we can automatically create fantastic images such as this one :)

Barkhaus dogs looking fancy

As one last test of the new CNN MMOD tool I made a dataset of 6975 faces. This dataset is a collection of face images selected from many publicly available datasets (excluding the FDDB dataset). In particular, there are images from ImageNet, AFLW, Pascal VOC, the VGG dataset, WIDER, and face scrub. Unlike FDDB, this new dataset contains faces in a wide range of poses rather than consisting of mostly front facing shots. To give you an idea of what it looks like, here are all the faces in the dataset tightly cropped and tiled into one big image:

Using the new dlib tooling I trained a CNN on this dataset using the same exact code and parameter settings as used by the dog hipsterizer and previous FDDB experiment. If you want to run that CNN on your own images you can use this example program. I tested this CNN on FDDB's unrestricted protocol and found that it has a recall of 0.879134, which is quite good. However, it produced 90 false alarms. Which sounds bad, until you look at them and find that it's finding labeling errors in FDDB. The following image shows all the "false alarms" it outputs on FDDB. All but one of them are actually faces.

Finally, to give you a more visceral idea of the difference in capability between the new CNN detector and the old HOG detector, here are a few images where I ran dlib's default HOG face detector (which is actually 5 HOG models) and the new CNN face detector. The red boxes are CNN detections and blue boxes are from the older HOG detector. While the HOG detector does an excellent job on easy faces looking at the camera, you can see that the CNN is way better at handling not just the easy cases but all faces in general. And yes, I ran the HOG detector on all the images, it's just that it fails to find any faces in some of them.

Hipsterize Your Dog With Deep Learning

2016-10-07T21:24:00.000-04:00

I'm getting ready to make the next dlib release, which should be out in a few days, and I thought I would point out a humorous new example program. The dog hipsterizer!

It uses dlib's new deep learning tools to detect dogs looking at the camera. Then it uses the dlib shape predictor to identify the positions of the eyes, nose, and top of the head. From there it's trivial to make your dog hip with glasses and a mustache :)

This is what you get when you run the dog hipsterizer on this awesome image:

Barkhaus dogs looking fancy

Dlib 19.1 Released

2016-08-13T14:48:00.000-04:00

cuDNN 5.1 is out and it isn't completely backwards compatible with cuDNN 5.0 due to a bug in cuDNN 5.1. For the curious, in cuDNN 5.1 cudnnGetConvolutionBackwardFilterAlgorithm() will select the winograd algorithm even when the conv descriptor has a stride not equal to 1, which is an error according to the cuDNN documentation. If you then try to run the winograd algorithm, which is what cudnnGetConvolutionBackwardFilterAlgorithm() says to do, it leads to the wrong outputs and things don't work. Fortunately, this was detected by dlib's unit tests :)

Therefore, dlib has been updated to work with cuDNN 5.1 and hence we have a dlib 19.1 release, which you can download from dlib's home page.

I also recently realized that the fancy std::async() in C++11, an API for launching asynchronous tasks, is not backed by any kind of load balancing at all. For example, if you call std::async() at a faster rate than the tasks complete then your program will create an unbounded number of threads, leading to an eventual crash. That's awful. But std::async() is a nice API and I want to use it. So dlib now contains dlib::async() which has the same interface, except instead of the half baked launch policy as the first argument, dlib::async() takes a dlib::thread_pool, giving dlib::async() all the bounded resource use properties of dlib::thread_pool. Moreover, if you don't give dlib::async() a thread pool it will default to a global thread pool instance that contains std::thread::hardware_concurrency() threads. Yay.

A Clean C++11 Deep Learning API

2016-06-26T08:29:00.002-04:00

Dlib 19.0 is out and it has a lot of new features, like new elastic net and quadratic program solvers. But the feature I'm most excited about is the new deep learning API. There are a lot of existing deep learning frameworks, but none of them have clean C++ APIs. You have to use them through a language like Python or Lua, which is fine in and of itself. But if you are a professional software engineer working on embedded computer vision projects you are probably working in C++, and using those tools in these kinds of applications can be frustrating.

So if you use C++ to do computer vision work then dlib's deep learning framework is for you. It makes heavy use of C++11 features, allowing it to expose a very clean and lightweight API. For example, the venerable LeNet can be defined in pure C++ with a using statement:

LeNet

    using LeNet = loss_multiclass_log<
                                fc<10,        
                                relu<fc<84,   
                                relu<fc<120,  
                                max_pool<2,2,2,2,relu<con<16,5,5,1,1,
                                max_pool<2,2,2,2,relu<con<6,5,5,1,1,
                                input<matrix<unsigned char>>>>>>>>>>>>>>;

Then, using it to train and test a neural network looks like this:

    LeNet net;
    dnn_trainer<LeNet> trainer(net);
    trainer.set_learning_rate(0.01);
    trainer.set_min_learning_rate(0.00001);
    trainer.set_mini_batch_size(128);
    trainer.train(training_images, training_labels);
    // Ask the net to predict labels for all the testing images
    auto predicted_labels = net(testing_images);

Dlib will even automatically switch to lower learning rates when the training error stops improving, so you won't have to fiddle with learning rate schedules. The API will certainly let you do so if you want that control. But I've been able to train a number of state-of-the-art ImageNet models without any manual fiddling of learning rates, which I find to be very convenient.

Depending on how you compile dlib, it will use either the CPU or cuDNN v5. It also supports using multiple GPUs during training and has a "fast mode" and a "low VRAM" mode. Compared to Caffe, dlib's fast mode is about 1.6x times faster than Caffe but uses about 1.5x as much VRAM, while the low VRAM mode is about 0.85x the speed of Caffe but uses half the VRAM as Caffe. So dlib's new deep learning API is fast but can also let you run larger models in the same amount of VRAM if you are VRAM constrained.

It's also fully documented. The basics are covered in this tutorial and then more advanced concepts are covered in a follow on tutorial. These tutorials show how to define LeNet and ResNet architectures in dlib and another tutorial shows how to define Inception networks. And even more importantly, every function and class in the API is documented in the reference material. Moreover, if you want to define your own computational layers, loss layers, input layers, or solvers, you can because the interfaces you have to implement are fully documented.

I've also included a pretrained ResNet34A model and this example shows how to use it to classify images. This pretrained model has a top5 error of 7.572% on the 2012 imagenet validation dataset, which is slightly better than the results reported in the original paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun. Training this model took about two weeks while running on a single Titan X GPU.

To use the new deep learning tools, all you need to install is cuDNN v5. Then you can compile the dlib example programs using the normal CMake commands. There are no other dependencies. In fact, if you don't install cuDNN CMake will automatically configure dlib to use only the CPU and the examples will still run (but much slower). You will however need a C++11 compiler, which precludes current versions of visual studio since they shamefully still lack full C++11 support. But any mildly recent version of GCC will work. Also, you can use visual studio with the non-DNN parts of dlib as they don't require C++11 support.

Finally, development of this new deep learning toolkit was sponsored by Systems & Technology Research, as part of the IARPA JANUS project. Without their support and feedback it wouldn't be nearly as polished and flexible. Jeffrey Byrne in particular was instrumental in finding bugs and usability problems in early versions of the API.

Reinforcement Learning, Control, and 3D Visualization

2015-06-05T18:45:00.001-04:00

Over the last few months I've spent a lot of time studying optimal control and reinforcement learning. Aside from reading, one of the best ways to learn about something is to do it yourself, which in this case means a lot of playing around with the well known algorithms, and for those I really like, including them into dlib, which is the subject of this post. So far I've added two methods, the first, added in a previous dlib release was the well known least squares policy iteration reinforcement learning algorithm. The second, and my favorite so far due to its practicality, is a tool for solving model predictive control problems.

There is a dlib example program that explains the new model predictive control tool in detail. But the basic idea is that it takes as input a simple linear equation defining how some process evolves in time and then tells you what control input you should apply to make the process go into some user specified state. For example, imagine you have an air vehicle with a rocket on it and you want it to hover at some specific location in the air. You could use a model predictive controller to find out what direction to fire the rocket at each moment to get the desired outcome. In fact, the dlib example program is just that. It produces the following visualization where the vehicle is the black dot and you want it to hover at the green location. The rocket thrust is shown as the red line:

Another fun new tool in dlib is the perspective_window. It's a super easy to use tool for visualizing 3D point cloud data. For instance, the included example program shows how to make this:

Finally, Patrick Snape contributed Python bindings for dlib's video tracker, so now you can use it from Python. To try out these new tools download the newest dlib release.

Python Stuff and Real-Time Video Object Tracking

2015-02-03T20:08:00.004-05:00

The new version of dlib is out today. As promised, there is now a full Python API for using dlib's state-of-the-art object pose estimation and learning tools. You can see examples of this API here and here. Thank Patrick Snape, one of the main developers of the menpo project, for this addition.

Also, I've added an implementation of the winning algorithm from last year's Visual Object Tracking Challenge. This was a method described in the paper:

Danelljan, Martin, et al. "Accurate scale estimation for robust visual tracking." Proceedings of the British Machine Vision Conference BMVC. 2014.

You can see some videos showing dlib's implementation of this new tracker in action on youtube:

All these videos were processed by exactly the same piece of software. No hand tweaking or any funny business. The only required input (other than the raw video) is a bounding box on the first frame and then the tracker automatically follows whatever is inside the box after that. The whole thing runs at over 150fps on my desktop. You can see an example program showing how to use it here, or just go download the new dlib instead :)

I've also finally posted the paper I've been writing on dlib's structural SVM based training algorithm, which is the algorithm behind the easy to use object detector.

Dlib 18.12 released

2014-12-20T16:34:00.001-05:00

I just released the next version of dlib. This time I added tools for computing 2D FFTs, Hough transforms, image skeletonizations, and also a simple and type safe API for calling C++ code from MATLAB. Readers familiar with writing MATLAB mex functions know how much of a pain it is, but no longer! Here is an example of a C++ function callable from MATLAB using dlib's new MATLAB binding API. You can also compile it with CMake so building it is super easy. There is an example CMake file in the dlib/matlab folder showing how to set it up. I also used this tool to give the MITIE project a simple MATLAB API. So you can see another example of how easy it is to set this up in the MITIE MATLAB example.

There are also some fun new things in the pipe for the next dlib release (v18.13). First, Patrick Snape, one of the main developers of the menpo project, is adding a Python interface to dlib's shape prediction tools. You can follow that over on dlib's github repo. I'm also working on a single object tracker for OpenCV's Vision Challenge which I plan to include in the next version of dlib.

Dlib 18.11 released

2014-11-15T10:32:00.001-05:00

The new version of dlib is out. This release contains mostly minor bug fixes and usability improvements, with the notable exception of new routines for extracting local-binary-pattern features from images and improved tools for learning distance metrics. See the release notes for further information.

I also recently found out about two particularly interesting projects that use dlib. The first is menpo, a Python library focused on computer vision which is being developed by a team at Imperial College London. If you are interested in a Python library that pulls together a bunch of computer vision tools then definitely check it out. The other interesting project is Ceemple, which is basically an interactive language shell for C++. They have integrated a bunch of libraries like dlib and OpenCV into it with the general goal of making C++ development feel more rapid and interactive. So think of something like MATLAB or IPython, but for C++.

MITIE v0.3 Released: Now with Java and R APIs

2014-10-21T10:41:00.001-04:00

We just made the next release of MITIE, a new DARPA funded information extraction tool being created by our team at MIT. This release is relatively minor and just adds APIs for Java and R. The project page on github explains how to get started using either of these APIs.

I want to take some time and explain how the Java API is implemented since, as I discovered while making MITIE's Java API, there aren't clear instructions for doing this anywhere on the internet. So hopefully this little tutorial will help you if you decide to make a similar Java binding to a C++ library. So to begin, let's think about the requirements for a good Java binding:

You should be able to compile it from source with a simple command
A user of your library should not need to edit or configure anything to compile the API
The compilation process should work on any platform
Writing JNI is awful so you shouldn't have to do that

This pretty much leads you to Swig and CMake which are both great tools. However, finding out how to get CMake to work with Swig was painful and is pretty much what this blog post is about. Happily, it's possible to do and results in a very clean and easy to use mechanism for creating Java APIs. In particular, you can compile MITIE's Swig/CMake based Java API using the usual CMake commands:

mkdir build
cd build
cmake ..
cmake --build . --config Release --target install

That creates a jar file and shared library file which together form the MITIE Java API. Let's run through a little example to see how you can define new Java APIs. Imagine you have created a simple C++ API that looks like this:

void printSomeString (const std::string& message);

class MyClass {
public:
    std::vector<std::string> getSomeStrings() const;
};

and you want to be able to use it from Java. You just need to put this C++ API in a header file called swig_api.h and include some Swig commands that tell it what to call std::vector<std::string> in the generated Java API. So the contents of swig_api.h would look like:

// Define some swig type maps that tell swig what to call various instantiations of
// std::vector.
#ifdef SWIG
%include "std_string.i"
%include "std_vector.i"
%template(StringVector)         std::vector<std::string>;
#endif

#include <string>
#include <vector>

void printSomeString (const std::string& message);

class MyClass {
public:
    std::vector<std::string> getSomeStrings() const;
};

The next step is to create a CMakeLists.txt file that tells CMake how to compile your API. In our case, it would look like:

cmake_minimum_required (VERSION 2.8.4)

project(example)
set(java_package_name  edu.mit.ll.example)

# List the source files you want to compile into the Java API.  These contain 
# things like implementations of printSomeString() and whatever else you need.
set(source_files my_source.cpp another_source_file.cpp )

# List the folders that contain your header files
include_directories( . )

# List of libraries to link to.  For example, you might need to link to pthread
set(additional_link_libraries pthread)

# Tell CMake to put the compiled shared library and example.jar file into the
# same folder as this CMakeLists.txt file when the --target install option is
# executed. You can put any folder here, just give a path that is relative to
# the CMakeLists.txt file.
set(install_target_output_folder .)

include(cmake_swig_jni)

That's it. Now you can compile your Java API using CMake and you will get an example.jar and example.dll or libexample.so file depending on your platform. Then to use it you can write java code like this:

import edu.mit.ll.example.*;
public class Example {
    public static void main(String args[]) {
        global.printSomeString("hello world!");

        MyClass obj = new MyClass();
        StringVector temp = obj.getSomeStrings();
        for (int i = 0; i < temp.size(); ++i)
            System.out.println(temp.get(i));
    }
}

and execute it via:

javac -classpath example.jar  Example.java
java -classpath example.jar;. -Djava.library.path=. Example

assuming the examle.jar and shared library are in your current folder. Note that Linux or OS X users will need to use a : as the classpath separator rather than ; as is required on Windows. But that's it! You just made a Java interface to your C++ library. You might have noticed the include(cmake_swig_jni) statement though. That is a bunch of CMake magic I had to write to make all this work, but work it does and on different platforms without trouble. You can see a larger example of a Java to C++ binding in MITIE's github repo using this same setup.

Real-Time Face Pose Estimation

2014-08-28T22:23:00.000-04:00

I just posted the next version of dlib, v18.10, and it includes a number of new minor features. The main addition in this release is an implementation of an excellent paper from this year's Computer Vision and Pattern Recognition Conference:

One Millisecond Face Alignment with an Ensemble of Regression Trees by Vahid Kazemi and Josephine Sullivan

As the name suggests, it allows you to perform face pose estimation very quickly. In particular, this means that if you give it an image of someone's face it will add this kind of annotation:

In fact, this is the output of dlib's new face landmarking example program on one of the images from the HELEN dataset. To get an even better idea of how well this pose estimator works take a look at this video where it has been applied to each frame:

It doesn't just stop there though. You can use this technique to make your own custom pose estimation models. To see how, take a look at the example program for training these pose estimation models.

MITIE v0.2 Released: Now includes Python and C++ APIs for named entity recognition and binary relation extraction

2014-07-10T16:02:00.000-04:00

A few months ago I posted about MITIE, the new DARPA funded information extraction tool being created by our team at MIT. At the time it only provided English named entity recognition and sported a simple C API. Since then we have been busy adding new features and today we released a new version of MITIE which adds a bunch of nice things, including:

Python and C++ APIs
Many example programs
21 English binary relation extractors which identify pairs of entities with certain relations. E.g. "PERSON BORN_IN PLACE"
Python, C, and C++ APIs for training your own named entity and binary relation extractors

You can get MITIE from its github page. Then you can try out some of the new features in v0.2, one of which is binary relation extraction. This means you can ask MITIE if two entities participate in some known relationship, for example, you can ask if a piece of text is making the claim that a person was born in a location. I.e. Are the person and location entities participating in the "born in" relationship?

In particular, you could run MITIE over all the Wikipedia articles that mention Barack Obama and find each instance where someone made the claim that Barack Obama was born in some place. I did this with MITIE and found the following:

14 claims that Barack Obama was born in Hawaii
5 claims that Barack Obama was born in the United States
3 claims that Barack Obama was born in Kenya

Which is humorous. One of them is the sentence:

You can still find sources of that type which still assert that "Barack Obama was born in Kenya"

When you read it in the broader context of the article it's clear that it's not claiming he was born in Kenya. So this is a good example of why it's important to aggregate over many relation instances when using a relation extractor. By aggregating many examples we can get reasonably accurate outputs in the face of these kinds of mistakes.

However, what is even more entertaining than poking fun at American political dysfunction is MITIE's new API for creating your own entity and relation extractors. We worked to make this very easy to use, and in particular, there are no parameters you need to mess with, everything is dealt with internal to MITIE. All you, the user, need to do is give example data showing what you want MITIE to learn to detect and it takes care of the rest. Moreover, in the spirit of easy to use APIs, we also added a new Python API that allows you to exercise all the functionality in MITIE via Python. As a little example, here is how you use it to find named entities:

from mitie import *
ner = named_entity_extractor('MITIE-models/english/ner_model.dat')
tokens = tokenize("The MIT Information Extraction (MITIE) tool was created \
                   by Davis King, Michael Yee, and Wade Shen at the \
                   Massachusetts Institute of Technology.")
print tokens

This loads in the English named entity recognizer model that comes with MITIE and then tokenizes the sentence. So the print statement produces

['The', 'MIT', 'Information', 'Extraction', '(', 'MITIE', ')', 'tool', 'was', 'created', 'by', 'Davis', 'King', ',', 'Michael', 'Yee', ',', 'and', 'Wade', 'Shen', 'at', 'the', 'Massachusetts', 'Institute', 'of', 'Technology', '.']

Then to find the named entities we simply do

entities = ner.extract_entities(tokens)
print "Number of entities detected:", len(entities)
print "Entities found:", entities

Which prints:

Number of entities detected: 6
Entities found: [(xrange(1, 4), 'ORGANIZATION'), (xrange(5, 6), 'ORGANIZATION'), (xrange(11, 13), 'PERSON'), (xrange(14, 16), 'PERSON'), (xrange(18, 20), 'PERSON'), (xrange(22, 26), 'ORGANIZATION')]

So the output is just a list of ranges and labels. Each range indicates which tokens are part of that entity. To print these out in a nice list we would just do

for e in entities:
    range = e[0]
    tag = e[1]
    entity_text = " ".join(tokens[i] for i in range)
    print tag + ": " + entity_text

Which prints:

ORGANIZATION: MIT Information Extraction
ORGANIZATION: MITIE
PERSON: Davis King
PERSON: Michael Yee
PERSON: Wade Shen
ORGANIZATION: Massachusetts Institute of Technology

Now go give the new MITIE a try!

Dlib 18.7 released: Make your own object detector in Python!

2014-04-09T21:54:00.002-04:00

A while ago I boasted about how dlib's object detection tools are better than OpenCV's. However, one thing OpenCV had on dlib was a nice Python API, but no longer! The new version of dlib is out and it includes a Python API for using and creating object detectors. What does this API look like? Well, lets start by imagining you want to detect faces in this image:

You would begin by importing dlib and scikit-image:

import dlib
from skimage import io

Then you load dlib's default face detector, the image of Obama, and then invoke the detector on the image:

detector = dlib.get_frontal_face_detector()
img = io.imread('obama.jpg')
faces = detector(img)

The result is an array of boxes called faces. Each box gives the pixel coordinates that bound each detected face. To get these coordinates out of faces you do something like:

for d in faces:
    print "left,top,right,bottom:", d.left(), d.top(), d.right(), d.bottom()

We can also view the results graphically by running:

win = dlib.image_window()
win.set_image(img)
win.add_overlay(faces)

But what if you wanted to create your own object detector? That's easy too. Dlib comes with an example program and a sample training dataset showing how to this. But to summarize, you do:

options = dlib.simple_object_detector_training_options()
options.C = 5  # Set the SVM C parameter to 5.  
dlib.train_simple_object_detector("training.xml","detector.svm", options)

That will run the trainer and save the learned detector to a file called detector.svm. The training data is read from training.xml which contains a list of images and bounding boxes. The example that comes with dlib shows the format of the XML file. There is also a graphical tool included that lets you mark up images with a mouse and save these XML files. Finally, to load your custom detector you do:

detector = dlib.simple_object_detector("detector.svm")

If you want to try it out yourself you can download the new dlib release here.

MITIE: A completely free and state-of-the-art information extraction tool

2014-04-03T22:45:00.001-04:00

I work at a MIT lab and there are a lot of cool things about my job. In fact, I could go on all day about it, but in this post I want to talk about one thing in particular, which is that we recently got funded by the DARPA XDATA program to make an open source natural language processing library focused on information extraction.

Why make such a thing when there are already open source libraries out there for this (e.g. OpenNLP, NLTK, Stanford IE, etc.)? Well, if you look around you quickly find out that everything which exists is either expensive, not state-of-the-art, or GPL licensed. If you wanted to use this kind of NLP tool in a non-GPL project then you are either out of luck, have to pay a lot of money, or settle for something of low quality. Well, not anymore! We just released the first version of our MIT Information Extraction library which is built using state-of-the-art statistical machine learning tools.

At this point it has just a C API and an example program showing how to do English named entity recognition. Over the next few weeks we will be adding bindings for other languages like Pyhton and Java. We will also be adding a lot more NLP tools in addition to named entity recognition, starting with relation extractors and part of speech taggers. But in the meantime you can use the C API or the streaming command line program. For example, if you had the following text in a file called sample_text.txt:

Meredith Vieira will become the first woman to host Olympics primetime coverage on her own when she fills on Friday night for the ailing Bob Costas, who is battling a continuing eye infection.

Then you can simply run:

cat sample_text.txt | ./ner_stream MITIE-models/ner_model.dat

And you get this as output:

[PERSON Meredith Vieira] will become the first woman to host [MISC Olympics] primetime coverage on her own when she fills on Friday night for the ailing [PERSON Bob Costas] , who is battling a continuing eye infection .

It's all up on github so if you want to try it out yourself then just run these commands and off you go:

git clone https://github.com/mit-nlp/MITIE.git
cd MITIE
./fetch_submodules.sh
make examples
make MITIE-models
cat sample_text.txt | ./ner_stream MITIE-models/ner_model.dat

Dlib 18.6 released: Make your own object detector!

2014-02-03T22:55:00.000-05:00

I just posted the next version of dlib, v18.6. There are a bunch of nice changes, but the most exciting addition is a tool for creating histogram-of-oriented-gradient (HOG) based object detectors. This is a technique for detecting semi-rigid objects in images which has become a classic computer vision method since its publication in 2005. In fact, the original HOG paper has been cited over 7000 times, which for those of you who don't follow the academic literature, is a whole lot.

But back to dlib, the new release has a tool that makes training HOG detectors super fast and easy. For instance, here is an example program that shows how to train a human face detector. All it needs as input is a set of images and bounding boxes around faces. On my computer it takes about 6 seconds to do its training using the example face data provided with dlib. Once finished it produces a HOG detector capable of detecting faces. An example of the detector's output on a new image (i.e. one it wasn't trained on) is shown below:

You should compare this to the time it takes to train OpenCV's popular cascaded haar object detector, which is generally reported to take hours or days to train and requires you to fiddle with false negative rates and all kinds of spurious parameters. HOG training is considerably simpler.

Moreover, the HOG trainer uses dlib's structural SVM based training algorithm which enables it to train on all the sub-windows in every image. This means you don't have to perform any tedious subsampling or "hard negative mining". It also means you often don't need that much training data. In particular, the example program that trains a face detector takes in only 4 images, containing a total of 18 faces. That is sufficient to produce the HOG detector used above. The example also shows you how to visualize the learned HOG detector, which in this case looks like:

It looks like a face! It should be noted that it's worth training on more than 4 images since it doesn't take that long to label and train on at least a few hundred objects and it can improve the accuracy. In particular, I trained a HOG face detector using about 3000 images from the labeled faces in the wild dataset and the training took only about 3 minutes. 3000 is probably excessive, but who cares when training is so fast.

The face detector which was trained on the labeled faces in the wild data comes with the new version of dlib. You can see how to use it in this face detection example program. The underlying detection code in dlib will make use of SSE instructions on Intel CPUs and this makes dlib's HOG detectors run at the same speed as OpenCV's fast cascaded object detectors. So for something like a 640x480 resolution web camera it's fast enough to run in real-time. As for the accuracy, it's easy to get the same detection rate as OpenCV but with thousands of times fewer false alarms. You can see an example in this youtube video which compares OpenCV's face detector to the new HOG face detector in dlib. The circles are from OpenCV's default face detector and the red squares are dlib's HOG based face detector. The difference is night and day.

Finally, here is another fun example. Before making this post I downloaded 8 images of stop signs from Google images, drew bounding boxes on them and then trained a HOG detector. This is the detector I got after a few seconds of training:

It looks like a stop sign and testing it on a new image works great.

All together it took me about 5 minutes to go from not having any data at all to a working stop sign detector. Not too shabby. Go try it out yourself. You can get the the new dlib release here :)

Adding a web interface to a C++ application

2007-03-09T19:10:00.000-05:00

One thing that is always sort of a pain is setting up a graphical user interface. This is especially true if you are making an embedded application or something that functions more as a system service or daemon. In this case you probably end up creating some simple network protocol which you use to control your application via some other remote piece of software or just telnet if you are feeling especially lazy.

What would be nice, however, is to have a web based interface but not have to jump through a bunch of hoops to add it into your code. This sort of simple add-in HTTP server is exactly what this post is all about.

A few months ago I was spending Christmas with the family. Good times all around but they live out in the middle of nowhere. And when I mean nowhere I'm talking no TV, 1 bar on the cell phone during a 'clear' conversation, no internet, and some farm animals. We do have what is seemingly an ex-circus goat that walks around by balancing on just its front two feet. You can imagine how entertaining that is but even so it only occupies you for so long. So naturally I broke out my laptop and created just the thing I have always wanted, a simple HTTP server I can stick into my C++ applications. I love Christmas :)

The code for the thing is available from sourceforge. There is also a more involved example program that can be viewed in its entirety here, but for the sake of creating a somewhat tidy little tutorial I'll show a simple example.

#include <dlib/server.h>
using namespace dlib;

class web_server : public server_http
{    
    const std::string on_request (
        const incoming_things& incoming,
        outgoing_things& outgoing    
    )
    {
        return "<html><body>Hurray for simple things!</body></html>";
    }
};

int main()
{
    web_server our_web_server;
    our_web_server.set_listening_port(80);
    our_web_server.start();
}

Basically what is happening here, if it isn't already obvious, is we are defining a class which acts as our HTTP server. To do this all you need to do is inherit from server_http and implement the virtual function on_request(). To turn it on just set the listening port number and call start(). That's it. If you compiled this and ran it you could check out the page it creates by directing your browser to http://localhost/ and it would pop up. You would see a page with the single line of text "Hurray for simple things!".

For an explanation of the arguments to the on_request() function check out the example program linked to above and/or check out the documentation on the dlib web site.