Monday, February 3, 2014

Dlib 18.6 released: Make your own object detector!

I just posted the next version of dlib, v18.6.  There are a bunch of nice changes, but the most exciting addition is a tool for creating histogram-of-oriented-gradient (HOG) based object detectors.  This is a technique for detecting semi-rigid objects in images which has become a classic computer vision method since its publication in 2005.  In fact, the original HOG paper has been cited over 7000 times, which for those of you who don't follow the academic literature, is a whole lot.

But back to dlib, the new release has a tool that makes training HOG detectors super fast and easy.  For instance, here is an example program that shows how to train a human face detector.  All it needs as input is a set of images and bounding boxes around faces.  On my computer it takes about 6 seconds to do its training using the example face data provided with dlib.  Once finished it produces a HOG detector capable of detecting faces.  An example of the detector's output on a new image (i.e. one it wasn't trained on) is shown below:


You should compare this to the time it takes to train OpenCV's popular cascaded haar object detector, which is generally reported to take hours or days to train and requires you to fiddle with false negative rates and all kinds of spurious parameters.  HOG training is considerably simpler.

Moreover, the HOG trainer uses dlib's structural SVM based training algorithm which enables it to train on all the sub-windows in every image.  This means you don't have to perform any tedious subsampling or "hard negative mining".  It also means you often don't need that much training data.  In particular, the example program that trains a face detector takes in only 4 images, containing a total of 18 faces.  That is sufficient to produce the HOG detector used above.  The example also shows you how to visualize the learned HOG detector, which in this case looks like:


It looks like a face!  It should be noted that it's worth training on more than 4 images since it doesn't take that long to label and train on at least a few hundred objects and it can improve the accuracy.  In particular, I trained a HOG face detector using about 3000 images from the labeled faces in the wild dataset and the training took only about 3 minutes.  3000 is probably excessive, but who cares when training is so fast.

The face detector which was trained on the labeled faces in the wild data comes with the new version of dlib. You can see how to use it in this face detection example program.  The underlying detection code in dlib will make use of SSE instructions on Intel CPUs and this makes dlib's HOG detectors run at the same speed as OpenCV's fast cascaded object detectors.  So for something like a 640x480 resolution web camera it's fast enough to run in real-time.  As for the accuracy, it's easy to get the same detection rate as OpenCV but with thousands of times fewer false alarms.  You can see an example in this youtube video which compares OpenCV's face detector to the new HOG face detector in dlib.  The circles are from OpenCV's default face detector and the red squares are dlib's HOG based face detector.   The difference is night and day. 


Finally, here is another fun example.  Before making this post I downloaded 8 images of stop signs from Google images, drew bounding boxes on them and then trained a HOG detector.  This is the detector I got after a few seconds of training:


It looks like a stop sign and testing it on a new image works great.


All together it took me about 5 minutes to go from not having any data at all to a working stop sign detector.  Not too shabby.  Go try it out yourself.  You can get the the new dlib release here :)

238 comments :

«Oldest   ‹Older   201 – 238 of 238
Davis King said...

Thanks.

The data is available here http://dlib.net/files/data/

Gabriel said...
This comment has been removed by the author.
Davis King said...

The data is available here http://dlib.net/files/data/

Mohammad Haghighat said...

Dear Davis,

I am trying to train the face detector without upsampling the images so that I can detect smaller faces.
I tried it on the example file "fhog_object_detector_ex.cpp". I commented out the two lines of upsampling (upsample_image_dataset), and decreased the detection window size to 40x40. However, it is not able to detect most of the faces in the test images. Could you please let me know where is my mistake?
I really appreciate you for creating such an amazing library.

Mohammad Haghighat said...

Dear Davis,

I am trying to train the face detector without upsampling the images so that I can detect smaller faces.
I tried it on the example file "fhog_object_detector_ex.cpp". I commented out the two lines of upsampling (upsample_image_dataset), and decreased the detection window size to 40x40. However, it is not able to detect most of the faces in the test images. Could you please let me know where is my mistake?
I really appreciate you for creating such an amazing library.

Davis King said...

You probably need more training data. But even more than that, the spatial resolution of HOG is, by default, low. It downsamples the image into 8x8 cells. So a 40x40 pixel window has only 5 HOG cells, which is not quite enough to model a face well. So the options are to change the HOG cell size to something smaller like 4, or to upsample the image so small faces get bigger. Both are going to have about the same computational requirements.

Hongbin Ma said...

hi David, many thanks for your great work. In OpenCV, we can choose only to detect the biggest face when calling the detector. Does dlib provide similar option to support detecting only the largest face so as to speed up the face detection?

Davis King said...

You can control what pyramid scales the detector runs at. Set the scales to find only big faces if that's what you want to do.

Hongbin Ma said...

Thanks a lot, David. How to set the pyramid scales the detector runs at? Could you please give a code example?

xhwang said...

Hi, David. Thanks for the excellent code and examples. I manage to use the frontal_face_detector in a multi-threading environment, but I have not found whether it's thread safe, is it? Best.

Davis King said...

It depends on how you use it. The usual rules about not touching an object from multiple threads at once apply.

Gábor Vecsei said...

Hi!

What is the prediction time for images around 1000x1000?

Tommi said...

Gabor: Probably in the magnitude of 100 ms

Marium Hashmi said...

I cant seem to understand how to do that? or where to do that?
The examples/faces folder contains some jpg images of people. You can run
this program on them and see the detections by executing the following command:
./face_detection_ex faces/*.jpg
I have successfully compiled dlib.
I am writing it here as
D:\Downloads\dlib\examples\build>./face_detection_ex faces/*.jpg

and
D:\Downloads\dlib\examples\build>./face_detection_ex ../faces/*.jpg

and I know I am not doing it right but I cant seem to get it at all

Saikrishna Dyavarasetti said...

Hi Davis,

I have a problem of compiling face_detection_ex.cpp for arm-linux-gnueabi-g++ on my machine (UBUNTU 16.04 LTS). Can you suggest the "command" with out using lX11 support.
And how to give jpeg support for arm-compilation.
When added -DDLIB_JPEG_SUPPORT the error is like :

/usr/lib/gcc-cross/arm-linux-gnueabi/5/../../../../arm-linux-gnueabi/bin/ld: cannot find -ljpeg
collect2: error: ld returned 1 exit status

Thanks In advance.

Saikrishna Dyavarasetti said...

Hi Davis,

int main(){

frontal_face_detector detector = get_frontal_face_detector();
dlib::array2d arr1;
dlib::load_image(arr1,"faces/2007_007763.jpg");
dlib::pyramid_up(arr1);
std::vector dets = detector(arr1);
cout << "Number of faces detected: " << dets.size() << endl;
}

in the above program what is the arr1 values and how it is defined.can we get how load_image works.
and Can we give a two dimensional array of an image in rgb_pixel to detector.if possible how...?

Please answer my question.It important and required.

Umang Sharma said...

Hey
I am getting this error , how to solve this :

An impossible set of object labels was detected. This is happening because none
of the object locations checked by the supplied image scanner is a close enough
match to one of the truth boxes. To resolve this you need to either lower the
match_eps or adjust the settings of the image scanner so that it hits this
truth box. Or you could adjust the offending truth rectangle so it can be
matched by the current image scanner. Also, if you are using the
scan_image_pyramid object then you could try using a finer image pyramid or
adding more detection templates. E.g. if one of your existing detection
templates has a matching width/height ratio and smaller area than the offending
rectangle then a finer image pyramid would probably help.

image index 0
match_eps: 0.5
best possible match: 0.485184
truth rect: [(590, 363) (738, 450)]
truth rect width/height: 1.69318
truth rect area: 13112
nearest detection template rect: [(604, 345) (700, 441)]
nearest detection template rect width/height: 1
nearest detection template rect area: 9409

Dante Knowles said...

Is it possible to increase some sort of threshold and get areas that have a probability of being a face but get missed anyways?

I'm currently using FHOG to detect faces, but the new DNN implementation has much higher recall. Unfortunately, its very slow and memory consuming on large images.

If I could use FHOG to find all tiles of the image that have some probability of a face and run those tiles through the DNN rather than the whole image, it could possibly increase overall recall and precision.

Thanks in advance Davis.

Davis King said...

Yes, the detector has an optional argument that lets you adjust the detection threshold. This is discussed in the documentation for the object_detector.

Jan said...

Hi Davis,

Your detector works very nice. It really produces almost 0 false-positives, which is amazing :) Thank you for developing and publishing it.

Right now I'm rotating my testing images to be able to detect rotated objects. Unfortunately since there are many different angles, detection takes a bit more time that I hoped. I believe that I can speed up the process if I exclude blank regions that are arising in corners of the images after rotating them. Unfortunately I can't find a place in a code where I can do that. I mean the place in a code, where I can select regions that will be searched for an object. Maybe you can suggest a function or file where to look for?

Tommi said...
This comment has been removed by the author.
Tommi said...

Jan, you can give a ROI to the function dlib::evaluate_detectors() in the following way:

dlib::rectangle roi = ...

std::vector sub_results;

dlib::array2d sub_image;

dlib::extract_image_chip(image, roi, sub_image);

dlib::evaluate_detectors(detectors_, sub_image, sub_results, minimum_confidence);

Jan said...

Thanks for the answer Tommi.

I'm now trying to figure out one more way to speed things up. How do I compile dlib with Python API in release mode?

For example to compile it with AVX instructions I can use this line:
python setup.py install --yes USE_AVX_INSTRUCTIONS

but how do I specify that I want to compile it in release mode?

Davis King said...

It's in release mode by default. You don't have to do anything.

Jan said...

Thanks Davis

Paul Bakker said...

Hi David,

First of all thank you for the wonderful dlib library. I had a question regarding your "max-margin object detection" paper.

In the paper you post excellent results for the MMOD-HOG vs the baseline-HOG. I was wondering if the frontal face detector (in "frontal_face_detector.h") provided with dlib is already trained in the MMOD way? Or do I need train my own face detector with "fhog_object_detector_ex.cpp" first to achieve this superior performance.

Furthermore I was wondering if you think offloading the gradient computations to a GPU makes sense, or the time lost with memcopy back and forth to the GPU is too much to make this an attractive option?

Thanks in advance!

Davis King said...

Yes, the default detector that comes with dlib is trained with MMOD.

I'm sure the HOG computation could be sped up with a GPU implementation. I don't think anyone is working on adding that to dlib at the moment though.

Paul Bakker said...

Thanks for your quick answer. For my project I need a very high speed face and landmark detector (500 FPS+), but it doesn't have to be very accurate (the faces are mostly pretty well visible). There's a good chance I'll start working on a GPU version of the HOG computation

László Gyimóthi said...
This comment has been removed by the author.
László Gyimóthi said...

Hi Davis!
I am trying to train my own fhog detector with 65 images (~10 detections each). However, just after loading the images and starting to train my object_detector, it throws the following error, and stops running:

"exception thrown!
bad allocation"

Is it likely to have a memory overflow with this small dataset? The images altogether take only approximately 200MB, and I have 4GB of RAM (I made sure to compile in x64 mode).

Davis King said...

Make sure you aren't upsampling the images in the code a bunch. But it does take a good amount of RAM.

William Grant said...

Hi David,
I read in your paper that compared to other detection methods, you are considering all windows (with a given aspect ratio). My question is, why are other methods only considering a subset if you it is possible to consider all windows like you do. And isn't it very expensive to compute \phi for every window?

Davis King said...

It's not easy to figure out how to consider all windows efficiently and in a way that makes sense. That's the central innovation of the paper.

William Grant said...

But isnt it still expensive to consider all windows? You consider one window size on every level of the pyramid right (in the new model with the cnn)?

Davis King said...

Computationally, it's just a matter of running the detector on an image. If that were too expensive then the whole thing would be pointless because the detector would be too slow to be useful once learned.

William Grant said...

Yes but region proposal methods like faster rcnn only run the detection (for example a cnn) on the proposed regions right? where as mmod runs it on all possible regions? Dont you have millions of regions and running a forward pass on millions of regions is very computationally expensive?

Davis King said...

I don't know what to tell you. Look at the network architecture and see what it does. It's running convolutionally over the image. There is nothing surprising about it's speed. Obviously if you used region proposals and recomputed things for each proposal then it would be terribly slow. But that's a silly thing to do. I realize lots of people do it, but it's not super clear to me why when it's easy to run an entire image pyramid through a CNN in a convolutional mode. That's super fast and covers all scales and box positions.

Phi-Vu Nguyen said...

Hi Davis,

I'm new to dlib, I notice that dlib supports Deep Learning, but does it support using Deep Learning for Pedestrian Detection? If yes, could you advise on how to do this? Thanks a lot for your help and for a great dlib!

Best,
phivu

«Oldest ‹Older   201 – 238 of 238   Newer› Newest»