## Tuesday, October 11, 2016

### Easily Create High Quality Object Detectors with Deep Learning

A few years ago I added an implementation of the max-margin object-detection algorithm (MMOD) to dlib. This tool has since become quite popular as it frees the user from tedious tasks like hard negative mining.  You simply label things in images and it learns to detect them.  It also produces high quality detectors from relatively small amounts of training data.  For instance, one of dlib's example programs shows MMOD learning a serviceable face detector from only 4 images.

However, the MMOD implementation in dlib used HOG feature extraction followed by a single linear filter. This means it's incapable of learning to detect objects that exhibit complex pose variation or have a lot of other variability in how they appear.  To get around this, users typically train multiple detectors, one for each pose.  That works OK in many cases but isn't a really good general solution. Fortunately, over the last few years convolutional neural networks have proven themselves to be capable of dealing with all these issues within a single model.

So the obvious thing to do was to add an implementation of MMOD with the HOG feature extraction replaced with a convolutional neural network.  The new version of dlib, v19.2, contains just such a thing.  On this page you can see a short tutorial showing how to train a convolutional neural network using the MMOD loss function.  It uses dlib's new deep learning API to train the detector end-to-end on the very same 4 image dataset used in the HOG version of the example program.  Happily, and very much to the surprise of myself and my colleagues, it learns a working face detector from this tiny dataset. Here is the detector run over an image not in the training data:

I expected the CNN version of MMOD to inherit the low training data requirements of the HOG version of MMOD, but working with only 4 training images is very surprising considering other deep learning methods typically require many thousands of images to produce any kind of sensible results.

The detector is also reasonably fast for a CNN.  On the CPU, it takes about 370ms to process a 640x480 image.  On my NVIDIA Titan X GPU (the Maxwell version, not the newer Pascal version) it takes 45ms to process an image when images are processed one at a time.  If I group the images into batches then it takes about 18ms per image.

To really test the new CNN version of MMOD, I ran it through the leading face detection benchmark, FDDB.  This benchmark has two modes, 10-fold cross-validation and unrestricted.  Both test on the same dataset, but in the 10-fold cross-validation mode you are only allowed to train on data in the FDDB dataset.  In the unrestricted mode you can train on any data you like so long as it doesn't include images from FDDB.  I ran the 10-fold cross-validation version of the FDDB challenge.  This means I trained 10 CNN face detectors, each on 9 folds and tested on the held out 10th.  I did not perform any hyper parameter tuning.  Then I ran the results through the FDDB evaluation software and got this plot:

The X axis is the number of false alarms produced over the entire 2845 image dataset.  The Y axis is recall, i.e. the fraction of faces found by the detector. The green curve is the new dlib detector, which in this mode only gets about 4600 faces to train on. The red curve is the old Viola Jones detector which is still popular (although it shouldn't be, obviously). Most interestingly, the blue curve is a state-of-the-art result from the paper Face Detection with the Faster R-CNN, published only 4 months ago. In that paper, they train their detector on the very large WIDER dataset, which consists of 159,424 faces, and arguably get worse results on FDDB than the dlib detector trained on only 4600 faces.

As another test, I created the dog hipsterizer, which I made a post about a few days ago.  The hipsterizer used the exact same code and parameter settings to train a dog head detector.  The only difference was the training data consisted in 9240 dog heads instead of human faces.  That produced the very high quality models used in the hipsterizer.  So now we can automatically create fantastic images such as this one :)

 Barkhaus dogs looking fancy

As one last test of the new CNN MMOD tool I made a dataset of 6975 faces.  This dataset is a collection of face images selected from many publicly available datasets (excluding the FDDB dataset).  In particular, there are images from ImageNet, AFLW, Pascal VOC, the VGG dataset, WIDER, and face scrub.  Unlike FDDB, this new dataset contains faces in a wide range of poses rather than consisting of mostly front facing shots.  To give you an idea of what it looks like, here are all the faces in the dataset tightly cropped and tiled into one big image:

Using the new dlib tooling I trained a CNN on this dataset using the same exact code and parameter settings as used by the dog hipsterizer and previous FDDB experiment. If you want to run that CNN on your own images you can use this example program. I tested this CNN on FDDB's unrestricted protocol and found that it has a recall of 0.879134, which is quite good. However, it produced 90 false alarms.  Which sounds bad, until you look at them and find that it's finding labeling errors in FDDB.  The following image shows all the "false alarms" it outputs on FDDB.  All but one of them are actually faces.

Finally, to give you a more visceral idea of the difference in capability between the new CNN detector and the old HOG detector, here are a few images where I ran dlib's default HOG face detector (which is actually 5 HOG models) and the new CNN face detector. The red boxes are CNN detections and blue boxes are from the older HOG detector. While the HOG detector does an excellent job on easy faces looking at the camera, you can see that the CNN is way better at handling not just the easy cases but all faces in general.  And yes, I ran the HOG detector on all the images, it's just that it fails to find any faces in some of them.

## Friday, October 7, 2016

### Hipsterize Your Dog With Deep Learning

I'm getting ready to make the next dlib release, which should be out in a few days, and I thought I would point out a humorous new example program.  The dog hipsterizer!

It uses dlib's new deep learning tools to detect dogs looking at the camera. Then it uses the dlib shape predictor to identify the positions of the eyes, nose, and top of the head. From there it's trivial to make your dog hip with glasses and a mustache :)

This is what you get when you run the dog hipsterizer on this awesome image:
 Barkhaus dogs looking fancy

## Saturday, August 13, 2016

### Dlib 19.1 Released

cuDNN 5.1 is out and it isn't completely backwards compatible with cuDNN 5.0 due to a bug in cuDNN 5.1.  For the curious, in cuDNN 5.1 cudnnGetConvolutionBackwardFilterAlgorithm() will select the winograd algorithm even when the conv descriptor has a stride not equal to 1, which is an error according to the cuDNN documentation.  If you then try to run the winograd algorithm, which is what cudnnGetConvolutionBackwardFilterAlgorithm() says to do, it leads to the wrong outputs and things don't work.  Fortunately, this was detected by dlib's unit tests :)

Therefore, dlib has been updated to work with cuDNN 5.1 and hence we have a dlib 19.1 release, which you can download from dlib's home page.

## Sunday, June 26, 2016

### A Clean C++11 Deep Learning API

Dlib 19.0 is out and it has a lot of new features, like new elastic net and quadratic program solvers. But the feature I'm most excited about is the new deep learning API. There are a lot of existing deep learning frameworks, but none of them have clean C++ APIs. You have to use them through a language like Python or Lua, which is fine in and of itself. But if you are a professional software engineer working on embedded computer vision projects you are probably working in C++, and using those tools in these kinds of applications can be frustrating.

So if you use C++ to do computer vision work then dlib's deep learning framework is for you. It makes heavy use of C++11 features, allowing it to expose a very clean and lightweight API. For example, the venerable LeNet can be defined in pure C++ with a using statement:

 LeNet

    using LeNet = loss_multiclass_log<
fc<10,
relu<fc<84,
relu<fc<120,
max_pool<2,2,2,2,relu<con<16,5,5,1,1,
max_pool<2,2,2,2,relu<con<6,5,5,1,1,
input<matrix<unsigned char>>>>>>>>>>>>>>;

Then, using it to train and test a neural network looks like this:

    LeNet net;
dnn_trainer<LeNet> trainer(net);
trainer.set_learning_rate(0.01);
trainer.set_min_learning_rate(0.00001);
trainer.set_mini_batch_size(128);
trainer.train(training_images, training_labels);
// Ask the net to predict labels for all the testing images
auto predicted_labels = net(testing_images);

Dlib will even automatically switch to lower learning rates when the training error stops improving, so you won't have to fiddle with learning rate schedules. The API will certainly let you do so if you want that control. But I've been able to train a number of state-of-the-art ImageNet models without any manual fiddling of learning rates, which I find to be very convenient.

Depending on how you compile dlib, it will use either the CPU or cuDNN v5. It also supports using multiple GPUs during training and has a "fast mode" and a "low VRAM" mode. Compared to Caffe, dlib's fast mode is about 1.6x times faster than Caffe but uses about 1.5x as much VRAM, while the low VRAM mode is about 0.85x the speed of Caffe but uses half the VRAM as Caffe. So dlib's new deep learning API is fast but can also let you run larger models in the same amount of VRAM if you are VRAM constrained.

It's also fully documented. The basics are covered in this tutorial and then more advanced concepts are covered in a follow on tutorial. These tutorials show how to define LeNet and ResNet architectures in dlib and another tutorial shows how to define Inception networks. And even more importantly, every function and class in the API is documented in the reference material. Moreover, if you want to define your own computational layersloss layers, input layers, or solvers, you can because the interfaces you have to implement are fully documented.

I've also included a pretrained ResNet34A model and this example shows how to use it to classify images. This pretrained model has a top5 error of 7.572% on the 2012 imagenet validation dataset, which is slightly better than the results reported in the original paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun. Training this model took about two weeks while running on a single Titan X GPU.

To use the new deep learning tools, all you need to install is cuDNN v5.  Then you can compile the dlib example programs using the normal CMake commands.  There are no other dependencies. In fact, if you don't install cuDNN CMake will automatically configure dlib to use only the CPU and the examples will still run (but much slower).  You will however need a C++11 compiler, which precludes current versions of visual studio since they shamefully still lack full C++11 support.  But any mildly recent version of GCC will work.  Also, you can use visual studio with the non-DNN parts of dlib as they don't require C++11 support.

Finally, development of this new deep learning toolkit was sponsored by Systems & Technology Research, as part of the IARPA JANUS project. Without their support and feedback it wouldn't be nearly as polished and flexible. Jeffrey Byrne in particular was instrumental in finding bugs and usability problems in early versions of the API.