Tuesday, October 11, 2016

Easily Create High Quality Object Detectors with Deep Learning

A few years ago I added an implementation of the max-margin object-detection algorithm (MMOD) to dlib. This tool has since become quite popular as it frees the user from tedious tasks like hard negative mining.  You simply label things in images and it learns to detect them.  It also produces high quality detectors from relatively small amounts of training data.  For instance, one of dlib's example programs shows MMOD learning a serviceable face detector from only 4 images.

However, the MMOD implementation in dlib used HOG feature extraction followed by a single linear filter. This means it's incapable of learning to detect objects that exhibit complex pose variation or have a lot of other variability in how they appear.  To get around this, users typically train multiple detectors, one for each pose.  That works OK in many cases but isn't a really good general solution. Fortunately, over the last few years convolutional neural networks have proven themselves to be capable of dealing with all these issues within a single model.

So the obvious thing to do was to add an implementation of MMOD with the HOG feature extraction replaced with a convolutional neural network.  The new version of dlib, v19.2, contains just such a thing.  On this page you can see a short tutorial showing how to train a convolutional neural network using the MMOD loss function.  It uses dlib's new deep learning API to train the detector end-to-end on the very same 4 image dataset used in the HOG version of the example program.  Happily, and very much to the surprise of myself and my colleagues, it learns a working face detector from this tiny dataset. Here is the detector run over an image not in the training data:


I expected the CNN version of MMOD to inherit the low training data requirements of the HOG version of MMOD, but working with only 4 training images is very surprising considering other deep learning methods typically require many thousands of images to produce any kind of sensible results.

The detector is also reasonably fast for a CNN.  On the CPU, it takes about 370ms to process a 640x480 image.  On my NVIDIA Titan X GPU (the Maxwell version, not the newer Pascal version) it takes 45ms to process an image when images are processed one at a time.  If I group the images into batches then it takes about 18ms per image.

To really test the new CNN version of MMOD, I ran it through the leading face detection benchmark, FDDB.  This benchmark has two modes, 10-fold cross-validation and unrestricted.  Both test on the same dataset, but in the 10-fold cross-validation mode you are only allowed to train on data in the FDDB dataset.  In the unrestricted mode you can train on any data you like so long as it doesn't include images from FDDB.  I ran the 10-fold cross-validation version of the FDDB challenge.  This means I trained 10 CNN face detectors, each on 9 folds and tested on the held out 10th.  I did not perform any hyper parameter tuning.  Then I ran the results through the FDDB evaluation software and got this plot:


The X axis is the number of false alarms produced over the entire 2845 image dataset.  The Y axis is recall, i.e. the fraction of faces found by the detector. The green curve is the new dlib detector, which in this mode only gets about 4600 faces to train on. The red curve is the old Viola Jones detector which is still popular (although it shouldn't be, obviously). Most interestingly, the blue curve is a state-of-the-art result from the paper Face Detection with the Faster R-CNN, published only 4 months ago. In that paper, they train their detector on the very large WIDER dataset, which consists of 159,424 faces, and arguably get worse results on FDDB than the dlib detector trained on only 4600 faces.

As another test, I created the dog hipsterizer, which I made a post about a few days ago.  The hipsterizer used the exact same code and parameter settings to train a dog head detector.  The only difference was the training data consisted in 9240 dog heads instead of human faces.  That produced the very high quality models used in the hipsterizer.  So now we can automatically create fantastic images such as this one :)

Barkhaus dogs looking fancy

As one last test of the new CNN MMOD tool I made a dataset of 6975 faces.  This dataset is a collection of face images selected from many publicly available datasets (excluding the FDDB dataset).  In particular, there are images from ImageNet, AFLW, Pascal VOC, the VGG dataset, WIDER, and face scrub.  Unlike FDDB, this new dataset contains faces in a wide range of poses rather than consisting of mostly front facing shots.  To give you an idea of what it looks like, here are all the faces in the dataset tightly cropped and tiled into one big image:


Using the new dlib tooling I trained a CNN on this dataset using the same exact code and parameter settings as used by the dog hipsterizer and previous FDDB experiment. If you want to run that CNN on your own images you can use this example program. I tested this CNN on FDDB's unrestricted protocol and found that it has a recall of 0.879134, which is quite good. However, it produced 90 false alarms.  Which sounds bad, until you look at them and find that it's finding labeling errors in FDDB.  The following image shows all the "false alarms" it outputs on FDDB.  All but one of them are actually faces.


Finally, to give you a more visceral idea of the difference in capability between the new CNN detector and the old HOG detector, here are a few images where I ran dlib's default HOG face detector (which is actually 5 HOG models) and the new CNN face detector. The red boxes are CNN detections and blue boxes are from the older HOG detector. While the HOG detector does an excellent job on easy faces looking at the camera, you can see that the CNN is way better at handling not just the easy cases but all faces in general.  And yes, I ran the HOG detector on all the images, it's just that it fails to find any faces in some of them.





254 comments :

«Oldest   ‹Older   201 – 254 of 254
Davis King said...

If you run the dnn_mmod_ex.cpp example, without making any modifications at all and using the provided dataset, you will see it working and finding faces. I don't know why your dataset isn't working. Maybe your dataset is bad. You should do some experiments and look at example datasets to figure out why it doesn't work for you.

Unknown said...

@Juhi Kumari: I had a similar problem, so I ended up creating this PR.

If you have time, you could checkout this branch, increase mmod_options::loss_per_missed_target (e.g., from 1 to 10), and re-run the training to see if it makes any difference at all.

Azzy said...

@Davis

In your dnn_mmod_ex.cpp you mentioned that the receptive field of the CNN was a little over 50x50.

1. How did you calculate this? Can you point to any resource?
2. Also does this size somehow relate to the minimum detection size?
3. What is relevance of the window size in mmod options; as in mmod_options options(face_boxes_train, 40,40);


I tried running the example as is but with my own training data (human heads in top view) but the average loss doesn't go below 1.0 and I get testing result 1 0 0; training result 1 0 0;

Is there anything I need to know about properly training this detector.

Davis King said...

Any resource that explains what a deep neural network is should make the idea of receptive field clear. I don't have any particular pointers. But any introductory material should make it obvious.

If the network doesn't look at the whole object, because the receptive field is too small, yes, that's obviously a huge problem.

These parameters are all explained in the documentation at some length. I don't know of a clearer way to explain it. I think you need to take a step back and read some introductory material on DNN and image processing fundamentals before you go further here.

Unknown said...

Hi Davis

I was looking at MMOD paper and I'm kinda confused by Equation (8).

The Constraints "F(x_i, y_i) >= max [F(x_i, y) + delta(y, y_i)] - eps_i " in your paper

are slightly different from what i saw in [10], which was in the references list.

In [10], The Constraints are defined as

" F(x_i, y_i) >= F(x_i, y) + delta(y, y_i) - eps_i for all i y ~= y_i "

Would you please tell me if these two constraints are identical or not.

Sorry for poor English.

thanks

Davis King said...

They are the same thing, it's just written slightly differently. Thinks about it. What is the max over?

Unknown said...

Hi Davis

First, thanks for your reply.

And I think I get it, the max is over all y belongs to Y.

With the max operation, you don't have to worry about those non-max y

because even the max one will follow the constraints.



Also, I got few question about how MMOD+CNN works.

As far as I know (or see), the Feature of Certain Region of Image was originally extracted

by HOG which is a fixed CV algorithm. In MMOD + CNN, you replace HOG with CNN, which means

that CNN now plays the role of Feature Extractor and it will somehow learn to generate good

feature of certain region of image. and here are my questions:

1. The CNN needs to be trained to be able to extract feature, And I know that you use SGD

to solve this. But What is the Objective Function of this problem ?

Is it the same one as the original one (eq.8) in MMOD paper?

If yes, How to deal all the constraints below?

2. I think you treat the entire MMOD + CNN as a big network and train it all together,

But if i think this model as a CNN Feature Extractor and a MMOD Structured SVM,

Can I treat the entire optimization process as you try to optimize a mmod svm with sgd

and use the gradient of loss to update the weight of cnn by backprop?

Davis King said...

Yep, you got it.

It's the same objective function as in the paper. See equation (13) which is unconstrained. It's just one big objective function and you can use SGD on it without any special tricks.

Unknown said...

Hi davis

I've run cnn face detector python example which is on Github

but it took more time to process a 600*400 pic than what you mentioned in this post.

I've tested this on 2 PC with different configuration.

One is running on CPU (i5-6400, 2.7GHz) and I use docker to run this on Windows.

Another is running on GTX 1080 Ti on Ubuntu.

The former took about 2.7 sec to finish the detection and the latter took 0.67 sec.

both are using Python API.

And Here I am, with several questions.

1. What is your PC Setting (CPU, GPU...)

2. Does Python API or Docker slow down entire detection process?

4. Is there any other reason affect the process speed?

Sorry for poor English

Davis King said...

You are probably upsampling the image. That example upsamples the image once, making it bigger. Bigger images take longer to process. So don't upsample it.

If you run on the CPU you should link to the Intel MKL so it's not slow.

No, there is nothing slow about the python API. It's just calling the same C++ code.

Unknown said...

Hi Davis

First, thanks for the advice.

I've disable the up-sampling and also link dlib to the mkl,

and it does run much faster than before.


Now, after it start to function well, I got some questions about the training data.

I know that a pre-trained weight of cnn face detector had been provided.

But what kind of training data do you use to train the model?

(Color space, Image Size, Pre-Processing ... and so on)

Just want to make sure that I'm using this model in a proper way

Thanks

Davis King said...

Use normal color images. There are example images in the examples/faces folder.

qiufengfly said...

HI,Davis.
Sorry to trouble you.This is a problem.I want to use dnn_metric_learn_on_images_ex.cpp train metric_network_renset.dat. use VS2015 build release mode have a error:dlib-19.7\dlib\dnn\core.h(2642): fatal error C1002:
hope you get your help.thank you!

qiufengfly said...

HI,Davis.
Sorry to trouble you.This is a problem.I want to use dnn_metric_learn_on_images_ex.cpp train metric_network_renset.dat. use VS2015 build release mode have a error:dlib-19.7\dlib\dnn\core.h(2642): fatal error C1002:
hope to get your help.thank you!

qiufengfly said...

HI,Davis.
Sorry to trouble you.This is a problem.I want to use dnn_metric_learn_on_images_ex.cpp train metric_network_renset.dat. use VS2015 build release mode have a error:dlib-19.7\dlib\dnn\core.h(2642): fatal error C1002:
hope to get your help.thank you!

Davis King said...

Get the latest dlib from github and follow the instructions written in the examples/CMakeLists.txt file for compiling that example. They contain instructions for working around this bug in visual studio.

Unknown said...
This comment has been removed by the author.
Unknown said...
This comment has been removed by the author.
Unknown said...

Hi great work here
but i have a problem executing mmod_face_detection example
i don't have this kind of problem with other examples just this one

when i execute it in Qt, gives an error:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

when i execute from build examples in cmd the error is:
std::bad_array_new_length

appreciate if u could help

Unknown said...

Hi great work here
but i have a problem executing mmod_face_detection example
i don't have this kind of problem with other examples just this one\

when i execute it in Qt, gives an error
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

when i execute from build examples in cmd the error is:
std::bad_array_new_length

appreciate if u could tell me what is wrong

Davis King said...

Are you running the example on the provided images?

Unknown said...

yes i am exactly doing what comments said,it is not complicated, should work like other codes

Davis King said...

Then your computer just doesn't have enough RAM. Try it on a smaller image.

Unknown said...

i have 8GB DDR3 RAM, is that enough?

it is interesting though, actually i have ran all dnn_mmod examples
the dog hipsterizer and findcars works fine but face detection and findcars2 give the error

thank u for your quick answers

naturalminer said...

Dear Davis, congrutalations at first. I have a question. Please correct me if I am wrong but as far as I understood, this algorithm detects only one type of objects (for ex:cars). However, I need to detect more than 5 type of classes? In other words, car, bycycle, truck, pedestrian. So, is it possible to classify and detect more than one object category in mmod cnn? If so, it ll be very helpful to my project. Thanks for your answer.

Davis King said...

Yes, newer versions support multiple class labels. See http://blog.dlib.net/2017/09/fast-multiclass-object-detection-in.html

Unknown said...

I train my network with default dnn_mmod_ex.cpp, just with lowered batch count at 50. And then when I have trained network and try to run it with your face recognition program I get.

"An error occurred while trying to read the first object from the file mmod_network.dat.
ERROR: Wrong padding_y found while deserializing dlib::con_"

I used imglab to label images of dogs and the used that to train my network. Everything is default. Also tried changing to default cropper size but the result is same. Any suggestions ?

Davis King said...

Those examples have slightly different networks. That's what the error is telling you. You can't load a saved network into a network object that is defined to have a different architecture. If you want to do this you need to change the code so that the network's are all defined the same way.

Unknown said...

Wow ! Amazing work!! Thank you sir!!!
Eagerly awaiting python bindings :o)

Davis King said...

Way ahead of you :), see http://blog.dlib.net/2014/04/dlib-187-released-make-your-own-object.html

Unknown said...

Hi, I studied the examples of face detection and training using hog. I want to speed up the face detection by using only 3 from 5 detectors from your pretrained model. I was hoping that after deserializing I get std::vector but it is only one object detector (object_detector > >) when I look at frontal_face_detector.h
It is even possible to use only 3 detectors (I want only frontal faces, not side looking)? Or I have to train it on my own?
thanks

Davis King said...

Look at the object_dector object. You can pull out whatever parts you want and pack them into a new object_detector. You don't need to retrain. http://dlib.net/dlib/image_processing/object_detector_abstract.h.html

Unknown said...

Excuse me , I want to know how many layer does this DNN face detection model ??

Davis King said...

You can see the network definition in the example program: http://dlib.net/dnn_mmod_face_detection_ex.cpp.html. For instance, there are 7 convolution layers.

Unknown said...

Hello , excuse me ,
can anyone teach me how to calculate how many layer does this DNN model ?

AbuShuvom said...

Awesome work Davis. I just want to know what is the minimum CUDA requirement for running dlib on GPU. I am planning to run it on Jetson TK1 which has a maxwell arch (Compute arch 3.2) and supports CUDA 6.5

Davis King said...

Anything with architecture 3.0 or newer should be fine.

AbuShuvom said...

Thats awesome !! I am running it now on Jetson TK1 like charm but when face is being recognized on live camera feed, it gets slower. Otherwise its real-time. I am quite suspicious whether its using CPU or GPU to process the recognition ResNet model..

Anguo Yang said...

Hi Davis,
I have problem on object detection, I really appreciate if there are solution using dlib.
I am currently using yolov2 and yolo-densenet for object detection, the result is not good on distinguishing objects with similar or same shape(but different color), even though I tried modifying parameters on data augmentation(hue,exposure, saturation, etc ) and other parameters, the result is still very bad. Perhaps because I could not get so much images for each class(currently only 8 images for each class), I am not sure, it is not easy to get so much images for each class, as we have more than 60,000 classes!

I remember that I have used the object detector sample in dlib 2 years ago, it is robust for color, however I used only for 1 single object detection, could you please give me some advice that if I want to detect objects with huge classes(10,000 at least) , thank you so much.

Anonymous said...

Hi, how can I compile it by using mingw ?

g++ -std=c++11 -O3 -I.. ../dlib/all/source.cpp -lpthread -lX11 example_program_name.cpp

produce lots of compile error.

Unknown said...

Hello Davis,
I really like Dlib and I have used this library for training my data for several months.
I have two questions about dnn face detection.

1. Does dnn face detector scans the image just like HoG detector did?
The scanning speed of HoG detection can be set by tuning scanner's parameters.
The code is something like this:
--
typedef dlib::scan_image_pyramid > image_scanner_type;
image_scanner_type scanner;
scanner.copy_configuration ( HoGDetector.get_scanner() );
scanner.set_max_pyramid_levels ( 3 );
--
But how about dnn detector?
I want to set the parameters of the scanner for tuning dnn scanning speed.

2. When I used dnn detector, can I get the confidence score of each sliding window?
Or can I get the detected boxes which the confidence is lower than 0?

Thank you.

Davis King said...

Yes, the DNN scans the image, more or less like HOG. There isn't much you can do to change the speed of the CNN without changing the network and retraining.

Yes, the CNN outputs confidence. Look at the documentation. It's all described in detail.

Unknown said...

Hi Davis,

I have a question about the sliding window detection with MMOD. Specifically, one of your examples states that the receptive field of the CNN in 50x50 pixels, but then the random cropper is selecting random dimensions in the range 40x40-270x270. Are 50x50 windows of the cropped image being passed into the network one at a time, or is it the entire cropped image?

Maybe I have some misunderstanding, but I appreciate your help!

Unknown said...

Hi Davis,

I have a question about the sliding window detection with MMOD. Specifically, one of your examples states that the receptive field of the CNN in 50x50 pixels, but then the random cropper is selecting random dimensions in the range 40x40-270x270. Are 50x50 windows of the cropped image being passed into the network one at a time, or is it the entire cropped image?

Maybe I have some misunderstanding, but I appreciate your help!

Davis King said...

Entire images are passed in. See http://blog.dlib.net/2017/08/vehicle-detection-with-dlib-195_27.html for more details. Also go run the code and look at the images produced by the random cropper.

Unknown said...

Those visualizations of the image pyramid and the heatmaps are exactly what I was looking for, thanks!

So if I understand correctly: if you had a sliding window of size 40x40, for example, but some of your labeled images had bounding boxes of size 80x80, then the sliding window would detect those larger labels in the lower levels of the image pyramid, and the bounding box would be scaled up accordingly on the output?

And one more question: If the output of the CNN in only one channel, then how is the bounding box information (width, height, coordinates) retrieved to compare to ground truth and compute the IoU?

Davis King said...

Yes, that's right.

There are multiple output channels to support multiple output box shapes and types.

Pfaeff said...

How long did it take to train the dataset of ~7000 faces?

Davis King said...

I forget how long it took. Probably under a day on a 1080ti.

Pfaeff said...

Thank you very much!

Ruslan said...

I'm new to dlib and object detection (faces too).
Recently read article and article and have some questions:
1. can't figure out what network is used in DLIB
2. is there any difference in using other networks with dlib? e.g. if I train a network (SSD+MobileNet, Faster R-CNN with MobileNet/Inception/Resnet, YOLO, etc.) what are specifications to use them?
3. how to optimize network and dlib for accuracy and speed?

The problems I have for now:
1. a long time for "detector", too much CPU usage
2. can not use the same "detector" in parallel (threads or tbb tasks for example)

Will appreciate any help or suggestions.
Thanks in advance.

Davis King said...

The answers to your questions are all discussed in detail in the dlib documentation. In particular, this blog post you are commenting discusses some of it with links to additional relevant documentation.

Unknown said...

Hi Davis!

I have a question in regards to the SGD training of the CNN.

In the MMOD paper in equation (8) we have the optimization problem at hands.
I understand this optimization for linear models of the type showed in equation (3), but what does the model parameters w mean in the case of a CNN? Is it the actual filter weights?

My interpretation is that phi in eqation (3) is the CNN, and w is the SVM parameters. But if this is the case, how is the CNN trained?

Sorry if it is an unclear question.

Davis King said...

The CNN in this context is f() in equation 3. It's some big function, with a bunch of parameters, that given an image and a location in that image tells you how likely it is there is an object at that location. SGD optimizes over all the parameters of f(). In the paper f() is linear in the parameters, but in the CNN it's not. That's the only difference.

«Oldest ‹Older   201 – 254 of 254   Newer› Newest»