Tuesday, October 11, 2016

Easily Create High Quality Object Detectors with Deep Learning

A few years ago I added an implementation of the max-margin object-detection algorithm (MMOD) to dlib. This tool has since become quite popular as it frees the user from tedious tasks like hard negative mining.  You simply label things in images and it learns to detect them.  It also produces high quality detectors from relatively small amounts of training data.  For instance, one of dlib's example programs shows MMOD learning a serviceable face detector from only 4 images.

However, the MMOD implementation in dlib used HOG feature extraction followed by a single linear filter. This means it's incapable of learning to detect objects that exhibit complex pose variation or have a lot of other variability in how they appear.  To get around this, users typically train multiple detectors, one for each pose.  That works OK in many cases but isn't a really good general solution. Fortunately, over the last few years convolutional neural networks have proven themselves to be capable of dealing with all these issues within a single model.

So the obvious thing to do was to add an implementation of MMOD with the HOG feature extraction replaced with a convolutional neural network.  The new version of dlib, v19.2, contains just such a thing.  On this page you can see a short tutorial showing how to train a convolutional neural network using the MMOD loss function.  It uses dlib's new deep learning API to train the detector end-to-end on the very same 4 image dataset used in the HOG version of the example program.  Happily, and very much to the surprise of myself and my colleagues, it learns a working face detector from this tiny dataset. Here is the detector run over an image not in the training data:


I expected the CNN version of MMOD to inherit the low training data requirements of the HOG version of MMOD, but working with only 4 training images is very surprising considering other deep learning methods typically require many thousands of images to produce any kind of sensible results.

The detector is also reasonably fast for a CNN.  On the CPU, it takes about 370ms to process a 640x480 image.  On my NVIDIA Titan X GPU (the Maxwell version, not the newer Pascal version) it takes 45ms to process an image when images are processed one at a time.  If I group the images into batches then it takes about 18ms per image.

To really test the new CNN version of MMOD, I ran it through the leading face detection benchmark, FDDB.  This benchmark has two modes, 10-fold cross-validation and unrestricted.  Both test on the same dataset, but in the 10-fold cross-validation mode you are only allowed to train on data in the FDDB dataset.  In the unrestricted mode you can train on any data you like so long as it doesn't include images from FDDB.  I ran the 10-fold cross-validation version of the FDDB challenge.  This means I trained 10 CNN face detectors, each on 9 folds and tested on the held out 10th.  I did not perform any hyper parameter tuning.  Then I ran the results through the FDDB evaluation software and got this plot:


The X axis is the number of false alarms produced over the entire 2845 image dataset.  The Y axis is recall, i.e. the fraction of faces found by the detector. The green curve is the new dlib detector, which in this mode only gets about 4600 faces to train on. The red curve is the old Viola Jones detector which is still popular (although it shouldn't be, obviously). Most interestingly, the blue curve is a state-of-the-art result from the paper Face Detection with the Faster R-CNN, published only 4 months ago. In that paper, they train their detector on the very large WIDER dataset, which consists of 159,424 faces, and arguably get worse results on FDDB than the dlib detector trained on only 4600 faces.

As another test, I created the dog hipsterizer, which I made a post about a few days ago.  The hipsterizer used the exact same code and parameter settings to train a dog head detector.  The only difference was the training data consisted in 9240 dog heads instead of human faces.  That produced the very high quality models used in the hipsterizer.  So now we can automatically create fantastic images such as this one :)

Barkhaus dogs looking fancy

As one last test of the new CNN MMOD tool I made a dataset of 6975 faces.  This dataset is a collection of face images selected from many publicly available datasets (excluding the FDDB dataset).  In particular, there are images from ImageNet, AFLW, Pascal VOC, the VGG dataset, WIDER, and face scrub.  Unlike FDDB, this new dataset contains faces in a wide range of poses rather than consisting of mostly front facing shots.  To give you an idea of what it looks like, here are all the faces in the dataset tightly cropped and tiled into one big image:


Using the new dlib tooling I trained a CNN on this dataset using the same exact code and parameter settings as used by the dog hipsterizer and previous FDDB experiment. If you want to run that CNN on your own images you can use this example program. I tested this CNN on FDDB's unrestricted protocol and found that it has a recall of 0.879134, which is quite good. However, it produced 90 false alarms.  Which sounds bad, until you look at them and find that it's finding labeling errors in FDDB.  The following image shows all the "false alarms" it outputs on FDDB.  All but one of them are actually faces.


Finally, to give you a more visceral idea of the difference in capability between the new CNN detector and the old HOG detector, here are a few images where I ran dlib's default HOG face detector (which is actually 5 HOG models) and the new CNN face detector. The red boxes are CNN detections and blue boxes are from the older HOG detector. While the HOG detector does an excellent job on easy faces looking at the camera, you can see that the CNN is way better at handling not just the easy cases but all faces in general.  And yes, I ran the HOG detector on all the images, it's just that it fails to find any faces in some of them.





217 comments :

«Oldest   ‹Older   201 – 217 of 217
Juhi Kumari said...

Hi Davis,

Thanks for the reply.

I am trying to train the CNN using sample program "dnn_mmod_ex1.cpp" (without changing any parameter). But after training is finished, its not detecting any face. I have taken a training dataset of 300 images.

What should I do to train the CNN properly? Do I need to increase number of training image dataset or iterations_without_progress_threshold should be increased to 8000?

Davis King said...

If you run the dnn_mmod_ex.cpp example, without making any modifications at all and using the provided dataset, you will see it working and finding faces. I don't know why your dataset isn't working. Maybe your dataset is bad. You should do some experiments and look at example datasets to figure out why it doesn't work for you.

Juha Reunanen said...

@Juhi Kumari: I had a similar problem, so I ended up creating this PR.

If you have time, you could checkout this branch, increase mmod_options::loss_per_missed_target (e.g., from 1 to 10), and re-run the training to see if it makes any difference at all.

Azzy said...

@Davis

In your dnn_mmod_ex.cpp you mentioned that the receptive field of the CNN was a little over 50x50.

1. How did you calculate this? Can you point to any resource?
2. Also does this size somehow relate to the minimum detection size?
3. What is relevance of the window size in mmod options; as in mmod_options options(face_boxes_train, 40,40);


I tried running the example as is but with my own training data (human heads in top view) but the average loss doesn't go below 1.0 and I get testing result 1 0 0; training result 1 0 0;

Is there anything I need to know about properly training this detector.

Davis King said...

Any resource that explains what a deep neural network is should make the idea of receptive field clear. I don't have any particular pointers. But any introductory material should make it obvious.

If the network doesn't look at the whole object, because the receptive field is too small, yes, that's obviously a huge problem.

These parameters are all explained in the documentation at some length. I don't know of a clearer way to explain it. I think you need to take a step back and read some introductory material on DNN and image processing fundamentals before you go further here.

阿法測試 said...

Hi Davis

I was looking at MMOD paper and I'm kinda confused by Equation (8).

The Constraints "F(x_i, y_i) >= max [F(x_i, y) + delta(y, y_i)] - eps_i " in your paper

are slightly different from what i saw in [10], which was in the references list.

In [10], The Constraints are defined as

" F(x_i, y_i) >= F(x_i, y) + delta(y, y_i) - eps_i for all i y ~= y_i "

Would you please tell me if these two constraints are identical or not.

Sorry for poor English.

thanks

Davis King said...

They are the same thing, it's just written slightly differently. Thinks about it. What is the max over?

阿法測試 said...

Hi Davis

First, thanks for your reply.

And I think I get it, the max is over all y belongs to Y.

With the max operation, you don't have to worry about those non-max y

because even the max one will follow the constraints.



Also, I got few question about how MMOD+CNN works.

As far as I know (or see), the Feature of Certain Region of Image was originally extracted

by HOG which is a fixed CV algorithm. In MMOD + CNN, you replace HOG with CNN, which means

that CNN now plays the role of Feature Extractor and it will somehow learn to generate good

feature of certain region of image. and here are my questions:

1. The CNN needs to be trained to be able to extract feature, And I know that you use SGD

to solve this. But What is the Objective Function of this problem ?

Is it the same one as the original one (eq.8) in MMOD paper?

If yes, How to deal all the constraints below?

2. I think you treat the entire MMOD + CNN as a big network and train it all together,

But if i think this model as a CNN Feature Extractor and a MMOD Structured SVM,

Can I treat the entire optimization process as you try to optimize a mmod svm with sgd

and use the gradient of loss to update the weight of cnn by backprop?

Davis King said...

Yep, you got it.

It's the same objective function as in the paper. See equation (13) which is unconstrained. It's just one big objective function and you can use SGD on it without any special tricks.

阿法測試 said...

Hi davis

I've run cnn face detector python example which is on Github

but it took more time to process a 600*400 pic than what you mentioned in this post.

I've tested this on 2 PC with different configuration.

One is running on CPU (i5-6400, 2.7GHz) and I use docker to run this on Windows.

Another is running on GTX 1080 Ti on Ubuntu.

The former took about 2.7 sec to finish the detection and the latter took 0.67 sec.

both are using Python API.

And Here I am, with several questions.

1. What is your PC Setting (CPU, GPU...)

2. Does Python API or Docker slow down entire detection process?

4. Is there any other reason affect the process speed?

Sorry for poor English

Davis King said...

You are probably upsampling the image. That example upsamples the image once, making it bigger. Bigger images take longer to process. So don't upsample it.

If you run on the CPU you should link to the Intel MKL so it's not slow.

No, there is nothing slow about the python API. It's just calling the same C++ code.

阿法測試 said...

Hi Davis

First, thanks for the advice.

I've disable the up-sampling and also link dlib to the mkl,

and it does run much faster than before.


Now, after it start to function well, I got some questions about the training data.

I know that a pre-trained weight of cnn face detector had been provided.

But what kind of training data do you use to train the model?

(Color space, Image Size, Pre-Processing ... and so on)

Just want to make sure that I'm using this model in a proper way

Thanks

Davis King said...

Use normal color images. There are example images in the examples/faces folder.

qiufengfly said...

HI,Davis.
Sorry to trouble you.This is a problem.I want to use dnn_metric_learn_on_images_ex.cpp train metric_network_renset.dat. use VS2015 build release mode have a error:dlib-19.7\dlib\dnn\core.h(2642): fatal error C1002:
hope you get your help.thank you!

qiufengfly said...

HI,Davis.
Sorry to trouble you.This is a problem.I want to use dnn_metric_learn_on_images_ex.cpp train metric_network_renset.dat. use VS2015 build release mode have a error:dlib-19.7\dlib\dnn\core.h(2642): fatal error C1002:
hope to get your help.thank you!

qiufengfly said...

HI,Davis.
Sorry to trouble you.This is a problem.I want to use dnn_metric_learn_on_images_ex.cpp train metric_network_renset.dat. use VS2015 build release mode have a error:dlib-19.7\dlib\dnn\core.h(2642): fatal error C1002:
hope to get your help.thank you!

Davis King said...

Get the latest dlib from github and follow the instructions written in the examples/CMakeLists.txt file for compiling that example. They contain instructions for working around this bug in visual studio.

«Oldest ‹Older   201 – 217 of 217   Newer› Newest»