Saturday, September 23, 2017

Fast Multiclass Object Detection in Dlib 19.7

The new version of dlib is out and the biggest new feature is the ability to train multiclass object detectors with dlib's convolutional neural network tooling.  The previous version only allowed you to train single class detectors, but this release adds the option to create single CNN models that output multiple labels.  As an example, I created a small 894 image dataset where I annotated the fronts and rears of cars and used it to train a 2-class detector.  You can see the resulting detector running in this video:

If you want to run the car detector from this video on your own images you can check out this example program.

I've also improved the detector speed in dlib 19.7 by pushing more of the processing to the GPU. This makes the detector 2.5x faster.  For example, running the detector on the 928x478 image used in this example program ran at 39fps in the previous version of dlib, but now runs at 98fps (when run on a NVIDIA 1080ti).

This release also includes a new 5-point face landmarking model that finds the corners of the eyes and bottom of nose:

Unlike the 68-point landmarking model included with dlib, this model is over 10x smaller at 8.8MB compared to the 68-point model's 96MB.  It also runs faster, and even more importantly, works with the state-of-the-art CNN face detector in dlib as well as the older HOG face detector in dlib.  The central use-case of the 5-point model is to perform 2D face alignment for applications like face recognition.  In any of the dlib code that does face alignment, the new 5-point model is a drop-in replacement for the 68-point model and in fact is the new recommended model to use with dlib's face recognition tooling.

Sunday, August 27, 2017

Vehicle Detection with Dlib 19.5

Dlib v19.5 is out and there are a lot of new features. There is a dlib to caffe converter, a bunch of new deep learning layer types, cuDNN v6 and v7 support, and a bunch of optimizations that make things run faster in different situations, like ARM NEON support, which makes HOG based detectors run a lot faster on mobile devices.

However, the coolest and most requested feature has been an upgrade to the CNN+MMOD object detector to support detecting things with varying aspect ratios. The previous version of the detector required the training data to consist of objects that all had essentially the same aspect ratio. This is fine for tasks like face detection and dog hipsterization, but obviously not as general as you would like.

So dlib v19.5 includes an updated version of the MMOD loss layer that can be used to learn an object detector from a dataset with any mixture of bounding box shapes and sizes. To demo this new feature, I used the new MMOD code to create a vehicle detector, which you can see running on these videos. This detector is trained to find cars moving with you in traffic, and therefore cars where the rear end of the vehicle is visible.

The detector is just as fast as previous versions of the CNN+MMOD detector. For instance, when I run it on my NVIDIA 1080ti I can process 39 frames per second when processing them individually and 93 frames per second when processing them grouped into batches. This assumes a frame size of 928x478.

If you want to run this detector yourself you can check out the new example program that does just that. The detector was trained on a modest dataset of 2217 images, which is also available, as is the training code. Both these new example programs contain a lot of information about training this kind of detector and are worth reading if you want to understand the details involved. However, we can go into a short description here to understand how the detector works.

Take this image as an example. I ran the new vehicle detector on it and plotted the resulting detections as red boxes. So what are the processing steps that go from the raw image to the 6 boxes?  To roughly summarize, they are:
1. Create an image pyramid and pack the pyramid into one big image. Let's call this the "tiled pyramid"
2. Run the tiled pyramid image through a CNN. The CNN outputs a new image where bright pixels in the output image indicate the presence of cars.
3. Find pixels in the CNN's output image with a value > 0. Those locations are your preliminary car detections.
4. Perform non-maximum suppression on the preliminary detections to produce the final output.
Steps 3 and 4 are pretty straightforward. It's the first two steps that are complicated. So to understand them, let's visualize the outputs of these first two steps. All step 1 does is call dlib::create_tiled_pyramid on the input image to produce this new image:

What's special about this image is that we don't need to worry about scale anymore. That is, suppose we have a detection algorithm that can find cars, but it only knows how to find cars of a certain size. No problem. When you run it on this tiled pyramid image you are going to find each car somewhere in it at the scale your detector expects. Moreover, the tiled pyramid is only about 3.7 times larger than the original image, so processing it instead of the raw image gives us full scale invariance for only a 3.7x increase in computational cost. That's a very reasonable trade. Moreover, tiling it inside a rectangular image makes it very easy to process using normal CNN tooling on a GPU and still get full GPU speeds.

Now for step 2. The CNN takes the tiled pyramid as input, does a bunch of convolutions, and outputs a new set of images. In the case of our vehicle detector, it outputs 3 new images, each is a detection strength map that gets "hot" in locations likely to contain a vehicle. The reason there are 3 images for the vehicle detector is because there are, roughly, 3 different aspect ratios (tall and skinny e.g. semi trucks, short and wide e.g. sedans, and squarish e.g. SUVs). For purposes of display, I have combined the 3 images into one by taking the pointwise max of the 3 original images.  You can see this combined image below. The dark blue areas are places the CNN is saying "definitely not a vehicle" and the bright red locations are the positions it thinks contain a vehicle.

If we overlay this CNN output on top of the tiled pyramid you can see it's doing the right thing. The cars get bright red dots on them, right in the centers of the cars. Moreover, you can tell that the CNN is only detecting cars at a certain scale. The smaller cars are detected at the top of the pyramid and only as we progress down the pyramid does it begin to detect the larger cars.

After the CNN output is obtained, all the detection code needs to do is threshold the CNN output, find all the hot spots, apply non-max suppression, and output the boxes corresponding to the identified hot spots. And that's it, that's all the CNN+MMOD detector is doing.

On the other hand, describing how the CNN is trained is more complicated.  The code in dlib uses the usual stochastic gradient descent methods. You can see many of the details if you read the dlib DNN example programs.  How deep learning works in general is a big topic, but the thing most interesting here is the MMOD loss layer.  For the gory details on that I refer you to the MMOD paper which explains the loss function.  In the paper it is discussed in the context of networks that are linear in their parameters rather than non-linear in their parameters, as is our CNN here. However, for understanding the loss the difference between linear vs. non-linear is a minor detail. In fact, the loss equations are the same for both cases. The only difference is what kind of optimization algorithms are available for each case.  In the linear parameter case you can write a fancy numeric solver capable of solving the problem in a few minutes, but with a non-linear parameterization you have to resort to brute force SGD and GPUs running for many hours.

But at a very high level, it's running the entire detection process over and over during training, counting the number of detection mistakes (false alarms, missed detections, and duplicate detections), and back-propagating that error gradient through the CNN until the CNN stops messing up. Also, since the MMOD loss layer is counting mistakes after non-max suppression is applied, it knows that it needs to get the CNN to avoid producing high outputs in parts of the image that won't be suppressed by non-max suppression. This is why you see the dark blue areas of "definitely not a car" surrounding each of the car detections. The CNN has learned that it needs to be very careful on the border between "it's a car" and "it's not a car" to avoid accidentally detecting the same car multiple times.

This is perhaps easiest to see if we merge the pyramid layers back into the original image. If we make an image where the pixel value is the max over all scales in the pyramid we get this image:

Here you can clearly see the 6 car hotspots and the dark blue areas of "not a car" immediately surrounding them. Finally, overlaying this on the original image gives this wonderful image:

Sunday, February 12, 2017

High Quality Face Recognition with Deep Metric Learning

Since the last dlib release, I've been working on adding easy to use deep metric learning tooling to dlib. Deep metric learning is useful for a lot of things, but the most popular application is face recognition. So obviously I had to add a face recognition example program to dlib. The new example comes with pictures of bald Hollywood action heroes and uses the provided deep metric model to identify how many different people there are and which faces belong to each person. The input images are shown below along with the four automatically identified face clusters:

Just like all the other example dlib models, the pretrained model used by this example program is in the public domain. So you can use it for anything you want. Also, the model has an accuracy of 99.38% on the standard Labeled Faces in the Wild benchmark. This is comparable to other state-of-the-art models and means that, given two face images, it correctly predicts if the images are of the same person 99.38% of the time.

For those interested in the model details, this model is a ResNet network with 29 conv layers. It's essentially a version of the ResNet-34 network from the paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun with a few layers removed and the number of filters per layer reduced by half.

The network was trained from scratch on a dataset of about 3 million faces. This dataset is derived from a number of datasets. The face scrub dataset[2], the VGG dataset[1], and then a large number of images I personally scraped from the internet. I tried as best I could to clean up the combined dataset by removing labeling errors, which meant filtering out a lot of stuff from VGG. I did this by repeatedly training a face recognition model and then using graph clustering methods and a lot of manual review to clean up the dataset. In the end, about half the images are from VGG and face scrub. Also, the total number of individual identities in the dataset is 7485. I made sure to avoid overlap with identities in LFW so the LFW evaluation would be valid.

The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level. The training code is obviously also available, since that sort of thing is basically the point of dlib. You can find all details on training and model specifics by reading the example program and consulting the referenced parts of dlib.  There is also a Python API for accessing the face recognition model.

[1] O. M. Parkhi, A. Vedaldi, A. Zisserman Deep Face Recognition British Machine Vision Conference, 2015.
[2] H.-W. Ng, S. Winkler. A data-driven approach to cleaning large face datasets. Proc. IEEE International Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014

Tuesday, October 11, 2016

Easily Create High Quality Object Detectors with Deep Learning

A few years ago I added an implementation of the max-margin object-detection algorithm (MMOD) to dlib. This tool has since become quite popular as it frees the user from tedious tasks like hard negative mining.  You simply label things in images and it learns to detect them.  It also produces high quality detectors from relatively small amounts of training data.  For instance, one of dlib's example programs shows MMOD learning a serviceable face detector from only 4 images.

However, the MMOD implementation in dlib used HOG feature extraction followed by a single linear filter. This means it's incapable of learning to detect objects that exhibit complex pose variation or have a lot of other variability in how they appear.  To get around this, users typically train multiple detectors, one for each pose.  That works OK in many cases but isn't a really good general solution. Fortunately, over the last few years convolutional neural networks have proven themselves to be capable of dealing with all these issues within a single model.

So the obvious thing to do was to add an implementation of MMOD with the HOG feature extraction replaced with a convolutional neural network.  The new version of dlib, v19.2, contains just such a thing.  On this page you can see a short tutorial showing how to train a convolutional neural network using the MMOD loss function.  It uses dlib's new deep learning API to train the detector end-to-end on the very same 4 image dataset used in the HOG version of the example program.  Happily, and very much to the surprise of myself and my colleagues, it learns a working face detector from this tiny dataset. Here is the detector run over an image not in the training data:

I expected the CNN version of MMOD to inherit the low training data requirements of the HOG version of MMOD, but working with only 4 training images is very surprising considering other deep learning methods typically require many thousands of images to produce any kind of sensible results.

The detector is also reasonably fast for a CNN.  On the CPU, it takes about 370ms to process a 640x480 image.  On my NVIDIA Titan X GPU (the Maxwell version, not the newer Pascal version) it takes 45ms to process an image when images are processed one at a time.  If I group the images into batches then it takes about 18ms per image.

To really test the new CNN version of MMOD, I ran it through the leading face detection benchmark, FDDB.  This benchmark has two modes, 10-fold cross-validation and unrestricted.  Both test on the same dataset, but in the 10-fold cross-validation mode you are only allowed to train on data in the FDDB dataset.  In the unrestricted mode you can train on any data you like so long as it doesn't include images from FDDB.  I ran the 10-fold cross-validation version of the FDDB challenge.  This means I trained 10 CNN face detectors, each on 9 folds and tested on the held out 10th.  I did not perform any hyper parameter tuning.  Then I ran the results through the FDDB evaluation software and got this plot:

The X axis is the number of false alarms produced over the entire 2845 image dataset.  The Y axis is recall, i.e. the fraction of faces found by the detector. The green curve is the new dlib detector, which in this mode only gets about 4600 faces to train on. The red curve is the old Viola Jones detector which is still popular (although it shouldn't be, obviously). Most interestingly, the blue curve is a state-of-the-art result from the paper Face Detection with the Faster R-CNN, published only 4 months ago. In that paper, they train their detector on the very large WIDER dataset, which consists of 159,424 faces, and arguably get worse results on FDDB than the dlib detector trained on only 4600 faces.

As another test, I created the dog hipsterizer, which I made a post about a few days ago.  The hipsterizer used the exact same code and parameter settings to train a dog head detector.  The only difference was the training data consisted in 9240 dog heads instead of human faces.  That produced the very high quality models used in the hipsterizer.  So now we can automatically create fantastic images such as this one :)

 Barkhaus dogs looking fancy

As one last test of the new CNN MMOD tool I made a dataset of 6975 faces.  This dataset is a collection of face images selected from many publicly available datasets (excluding the FDDB dataset).  In particular, there are images from ImageNet, AFLW, Pascal VOC, the VGG dataset, WIDER, and face scrub.  Unlike FDDB, this new dataset contains faces in a wide range of poses rather than consisting of mostly front facing shots.  To give you an idea of what it looks like, here are all the faces in the dataset tightly cropped and tiled into one big image:

Using the new dlib tooling I trained a CNN on this dataset using the same exact code and parameter settings as used by the dog hipsterizer and previous FDDB experiment. If you want to run that CNN on your own images you can use this example program. I tested this CNN on FDDB's unrestricted protocol and found that it has a recall of 0.879134, which is quite good. However, it produced 90 false alarms.  Which sounds bad, until you look at them and find that it's finding labeling errors in FDDB.  The following image shows all the "false alarms" it outputs on FDDB.  All but one of them are actually faces.

Finally, to give you a more visceral idea of the difference in capability between the new CNN detector and the old HOG detector, here are a few images where I ran dlib's default HOG face detector (which is actually 5 HOG models) and the new CNN face detector. The red boxes are CNN detections and blue boxes are from the older HOG detector. While the HOG detector does an excellent job on easy faces looking at the camera, you can see that the CNN is way better at handling not just the easy cases but all faces in general.  And yes, I ran the HOG detector on all the images, it's just that it fails to find any faces in some of them.

Friday, October 7, 2016

Hipsterize Your Dog With Deep Learning

I'm getting ready to make the next dlib release, which should be out in a few days, and I thought I would point out a humorous new example program.  The dog hipsterizer!

It uses dlib's new deep learning tools to detect dogs looking at the camera. Then it uses the dlib shape predictor to identify the positions of the eyes, nose, and top of the head. From there it's trivial to make your dog hip with glasses and a mustache :)

This is what you get when you run the dog hipsterizer on this awesome image:
 Barkhaus dogs looking fancy

Saturday, August 13, 2016

Dlib 19.1 Released

cuDNN 5.1 is out and it isn't completely backwards compatible with cuDNN 5.0 due to a bug in cuDNN 5.1.  For the curious, in cuDNN 5.1 cudnnGetConvolutionBackwardFilterAlgorithm() will select the winograd algorithm even when the conv descriptor has a stride not equal to 1, which is an error according to the cuDNN documentation.  If you then try to run the winograd algorithm, which is what cudnnGetConvolutionBackwardFilterAlgorithm() says to do, it leads to the wrong outputs and things don't work.  Fortunately, this was detected by dlib's unit tests :)

Therefore, dlib has been updated to work with cuDNN 5.1 and hence we have a dlib 19.1 release, which you can download from dlib's home page.

Sunday, June 26, 2016

A Clean C++11 Deep Learning API

Dlib 19.0 is out and it has a lot of new features, like new elastic net and quadratic program solvers. But the feature I'm most excited about is the new deep learning API. There are a lot of existing deep learning frameworks, but none of them have clean C++ APIs. You have to use them through a language like Python or Lua, which is fine in and of itself. But if you are a professional software engineer working on embedded computer vision projects you are probably working in C++, and using those tools in these kinds of applications can be frustrating.

So if you use C++ to do computer vision work then dlib's deep learning framework is for you. It makes heavy use of C++11 features, allowing it to expose a very clean and lightweight API. For example, the venerable LeNet can be defined in pure C++ with a using statement:

 LeNet

    using LeNet = loss_multiclass_log<
fc<10,
relu<fc<84,
relu<fc<120,
max_pool<2,2,2,2,relu<con<16,5,5,1,1,
max_pool<2,2,2,2,relu<con<6,5,5,1,1,
input<matrix<unsigned char>>>>>>>>>>>>>>;

Then, using it to train and test a neural network looks like this:

    LeNet net;
dnn_trainer<LeNet> trainer(net);
trainer.set_learning_rate(0.01);
trainer.set_min_learning_rate(0.00001);
trainer.set_mini_batch_size(128);
trainer.train(training_images, training_labels);
// Ask the net to predict labels for all the testing images
auto predicted_labels = net(testing_images);

Dlib will even automatically switch to lower learning rates when the training error stops improving, so you won't have to fiddle with learning rate schedules. The API will certainly let you do so if you want that control. But I've been able to train a number of state-of-the-art ImageNet models without any manual fiddling of learning rates, which I find to be very convenient.

Depending on how you compile dlib, it will use either the CPU or cuDNN v5. It also supports using multiple GPUs during training and has a "fast mode" and a "low VRAM" mode. Compared to Caffe, dlib's fast mode is about 1.6x times faster than Caffe but uses about 1.5x as much VRAM, while the low VRAM mode is about 0.85x the speed of Caffe but uses half the VRAM as Caffe. So dlib's new deep learning API is fast but can also let you run larger models in the same amount of VRAM if you are VRAM constrained.

It's also fully documented. The basics are covered in this tutorial and then more advanced concepts are covered in a follow on tutorial. These tutorials show how to define LeNet and ResNet architectures in dlib and another tutorial shows how to define Inception networks. And even more importantly, every function and class in the API is documented in the reference material. Moreover, if you want to define your own computational layersloss layers, input layers, or solvers, you can because the interfaces you have to implement are fully documented.

I've also included a pretrained ResNet34A model and this example shows how to use it to classify images. This pretrained model has a top5 error of 7.572% on the 2012 imagenet validation dataset, which is slightly better than the results reported in the original paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun. Training this model took about two weeks while running on a single Titan X GPU.

To use the new deep learning tools, all you need to install is cuDNN v5.  Then you can compile the dlib example programs using the normal CMake commands.  There are no other dependencies. In fact, if you don't install cuDNN CMake will automatically configure dlib to use only the CPU and the examples will still run (but much slower).  You will however need a C++11 compiler, which precludes current versions of visual studio since they shamefully still lack full C++11 support.  But any mildly recent version of GCC will work.  Also, you can use visual studio with the non-DNN parts of dlib as they don't require C++11 support.

Finally, development of this new deep learning toolkit was sponsored by Systems & Technology Research, as part of the IARPA JANUS project. Without their support and feedback it wouldn't be nearly as polished and flexible. Jeffrey Byrne in particular was instrumental in finding bugs and usability problems in early versions of the API.