## Monday, February 3, 2014

### Dlib 18.6 released: Make your own object detector!

I just posted the next version of dlib, v18.6.  There are a bunch of nice changes, but the most exciting addition is a tool for creating histogram-of-oriented-gradient (HOG) based object detectors.  This is a technique for detecting semi-rigid objects in images which has become a classic computer vision method since its publication in 2005.  In fact, the original HOG paper has been cited over 7000 times, which for those of you who don't follow the academic literature, is a whole lot.

But back to dlib, the new release has a tool that makes training HOG detectors super fast and easy.  For instance, here is an example program that shows how to train a human face detector.  All it needs as input is a set of images and bounding boxes around faces.  On my computer it takes about 6 seconds to do its training using the example face data provided with dlib.  Once finished it produces a HOG detector capable of detecting faces.  An example of the detector's output on a new image (i.e. one it wasn't trained on) is shown below:

You should compare this to the time it takes to train OpenCV's popular cascaded haar object detector, which is generally reported to take hours or days to train and requires you to fiddle with false negative rates and all kinds of spurious parameters.  HOG training is considerably simpler.

Moreover, the HOG trainer uses dlib's structural SVM based training algorithm which enables it to train on all the sub-windows in every image.  This means you don't have to perform any tedious subsampling or "hard negative mining".  It also means you often don't need that much training data.  In particular, the example program that trains a face detector takes in only 4 images, containing a total of 18 faces.  That is sufficient to produce the HOG detector used above.  The example also shows you how to visualize the learned HOG detector, which in this case looks like:

It looks like a face!  It should be noted that it's worth training on more than 4 images since it doesn't take that long to label and train on at least a few hundred objects and it can improve the accuracy.  In particular, I trained a HOG face detector using about 3000 images from the labeled faces in the wild dataset and the training took only about 3 minutes.  3000 is probably excessive, but who cares when training is so fast.

The face detector which was trained on the labeled faces in the wild data comes with the new version of dlib. You can see how to use it in this face detection example program.  The underlying detection code in dlib will make use of SSE instructions on Intel CPUs and this makes dlib's HOG detectors run at the same speed as OpenCV's fast cascaded object detectors.  So for something like a 640x480 resolution web camera it's fast enough to run in real-time.  As for the accuracy, it's easy to get the same detection rate as OpenCV but with thousands of times fewer false alarms.  You can see an example in this youtube video which compares OpenCV's face detector to the new HOG face detector in dlib.  The circles are from OpenCV's default face detector and the red squares are dlib's HOG based face detector.   The difference is night and day.

Finally, here is another fun example.  Before making this post I downloaded 8 images of stop signs from Google images, drew bounding boxes on them and then trained a HOG detector.  This is the detector I got after a few seconds of training:

It looks like a stop sign and testing it on a new image works great.

All together it took me about 5 minutes to go from not having any data at all to a working stop sign detector.  Not too shabby.  Go try it out yourself.  You can get the the new dlib release here :)

Morteza said...

Hi,
first thanks for your good blog.
then
I try to make dlib work as an object detector,I followed the instructions but now that I run the face detection sample on faces folder I get this error:
./a.out examples/faces
exception thrown!
DLIB_JPEG_SUPPORT not #defined: Unable to load image in file 2007_007763.jpg

Do you have any Idea?!

Davis King said...

You need to tell your compiler to #define DLIB_JPEG_SUPPORT and then link to libjpeg. If you aren't sure how to do this then just use cmake to generate your project and it will set it up for you. There are detailed instructions here: http://dlib.net/compile.html.

Cheers,
Davis

P.S. Sorry for the late reply. I didn't see that this comment was posted until just now.

Morteza said...

Yeap, I figured it out too!
Interesting blog.

Unknown said...

Hi,

Thanks for this library. I have the examples built and everything is working well.

How did you perform the real time face detection? The example programs seem to take around 15 - 20 seconds per image to detect the faces.

Davis King said...

It should be much faster. Are you sure you compiled the code with optimizations enabled?

Unknown said...

Wow, you were right. Very impressive. Job well done!

Davis King said...

Thanks :)

Stefanelus said...

Really cool stuff. I tried to train a pedestrian detector using HOG based on INRIAPerson data set. One of the issues which I encounter is with the size of the bounding boxes, it start to complain about the bounded boxes ration. Also if I add more images for training the: Test detector (precision,recall,AP): 1 1 1 - recall and AP will start to decrease.

Did someone manage to train a pedestrian classifier with success?

Davis King said...

This is a sliding window detector, so you have to make your bounding boxes such that the sliding window can hit them. This means they all have to have more or less the same aspect ratio or you will get that error.

Adding more images to training should, as a general rule, not reduce the testing accuracy. I suspect there is something inconsistent with how you are creating your bounding boxes. I would plot the truth boxes you are making on top of the images and make sure they appear on the images where you expect them to appear.

Finally, the trainer considers any part of the image you didn't explicitly label as a person as a false alarm. So it's very important to label all the people. One thing you can do to make labeling all the people easier is to setup ignore boxes. This is where you tell the trainer to just ignore parts of the image you don't want to label. You might do this if you have a whole bunch of pedestrians in a group and you don't want to go to the trouble to label them all individually. So you can put a box around the group and tell the trainer to just ignore that area.

At any rate, you pass the ignore boxes in to the 3rd argument to train(). See: http://dlib.net/dlib/svm/structural_object_detection_trainer_abstract.h.html#train

Stefanelus said...

many thanks for your feedback, I'll gave it a try, in the INRIAPerson dataset the body pose is different from image to image, this may effect the final outcome ?

I'm very impressed by Dlib, I hope to understand how to train more effective a classifier, the face detector is working so well. If I can get the same level of accuracy for pedestrian that will be outstanding.

I'll put the dataset online, maybe some else will use it as well.

Davis King said...

Thanks, I'm glad you like it :)

When using HOG, it's best to cluster the different poses together and train a detector for each pose. That's what I did for the face detector and it gives improved results. However, HOG was originally proposed for pedestrian detection and they just lumped all the poses together and got good results on the INRIA person dataset. So I would try training one detector for everything first and see how well that works.

Stefanelus said...

hey Davis,

it's been a while using dlib, very positive :).

I have a question about hog and how evaluate classifier.

std::vector > my_detectors;
my_detectors.push_back(detector);
std::vector dets = evaluate_detectors(my_detectors, images_train[0]);

in that piece of code I can push a series of classificators and I get back a vector with rectangles. There is any way to determine which classifier trigger a object detection ?

Best regards,
Stefan

Davis King said...

You can do it like this:

std::vector<rect_detection> dets;

Then you call:

evaluate_detectors(detectors,img,dets);

Inside the rect_detection (see also http://dlib.net/dlib/image_processing/object_detector_abstract.h.html#rect_detection) is a weight_index field which tells you which detector the detection came from.

Cheers,
Davis

Stefanelus said...

outstanding, it's really cool.

Unknown said...

Hi Davis,

Thank you very much for the code example and explanation!

I'm trying to train a hog detector with a traffic light dataset but I'm getting a exception: " An impossible set of object labels was detected. (...)"

My dataset has 10 traffic lights with an aspect ration around 0.39 - 0.47.

I tried to change the window size to 20 x 45 ( w x h ), but it didn't work either. I don't understand the value 6 on line 122 " typedef scan_fhog_pyramid > image_scanner_type; ". Changing the value of pyramid_down would help me? How could I change it?

Do you know what should I do to train a hog detection on my dataset?

Unknown said...
This comment has been removed by the author.
Unknown said...
This comment has been removed by the author.
Davis King said...

I'm not sure. What is the full error message that you get?

Unknown said...

Hi Davis,

The full error message is:
exception thrown!
An impossible set of object labels was detected. This is happening because none
of the object locations checked by the supplied image scanner is a close enough
match to one of the truth boxes. To resolve this you need to either lower the
match_eps or adjust the settings of the image scanner so that it hits this
truth box. Or you could adjust the offending truth rectangle so it can be
matched by the current image scanner. Also, if you are using the
scan_image_pyramid object then you could try using a finer image pyramid or
templates has a matching width/height ratio and smaller area than the offending
rectangle then a finer image pyramid would probably help.

image index 0
match_eps: 0.5
best possible match: 0.35
truth rect: [(862, 28) (882, 82)]
truth rect width/height: 0.381818
truth rect area: 1155
nearest detection template rect: [(869, 45) (885, 77)]
nearest detection template rect width/height: 0.515152
nearest detection template rect area: 561
Do you have any clue of what this means?
Thanks again!

Davis King said...

You are using a detection window that is a square (because it says "nearest detection template rect width/height: 0.515152"), however, your traffic lights have a very different aspect ratio (i.e. "truth rect width/height: 0.381818"). So you need to set the size of the sliding window to match the objects you want to detect. This is done by calling scanner.set_detection_window_size() which is shown in the example. Setting it to the average size of your traffic lights will probably work fine.

So try something like this: scanner.set_detection_window_size(32, 80)

Cheers,
Davis

Unknown said...

I tried as you said and I'm still getting the same error.

"exception thrown!
An impossible set of object labels was detected. This is happening because none
of the object locations checked by the supplied image scanner is a close enough
match to one of the truth boxes. To resolve this you need to either lower the
match_eps or adjust the settings of the image scanner so that it hits this
truth box. Or you could adjust the offending truth rectangle so it can be
matched by the current image scanner. Also, if you are using the
scan_image_pyramid object then you could try using a finer image pyramid or
templates has a matching width/height ratio and smaller area than the offending
rectangle then a finer image pyramid would probably help.

image index 1
match_eps: 0.5
best possible match: 0.174795
truth rect: [(1004, 172) (1014, 200)]
truth rect width/height: 0.37931
truth rect area: 319
nearest detection template rect: [(997, 149) (1021, 221)]
nearest detection template rect width/height: 0.342466
nearest detection template rect area: 1825"

I also tried to change the trainer.set_epsilon(0.01); to 0.1 and 0.2, but without any improvement. As I mentioned in my first comment, I don't understand the line 122 "typedef scan_fhog_pyramid > image_scanner_type;". I think that given what you said in the comments of the code that changing the value 6 could change something. I don't know what to do now. Do you have any other idea that could help me? If you want I can send you my dataset.
Thank you very much for the quick reply and the help!

Davis King said...

This time it's complaining because one of your annotations is only 319 pixels in area while the sliding window is 1825 pixels in area. So it will never be able to match that 319 pixel box.

You have to either remove that small box or upsample the images until it's large enough.

Don't worry about the 6 in the pyramid_down code. That's not the problem here. Although, if you want to see what happens when you change that number then just change it and see what happens :). But 6 will work fine for most applications.

Cheers,
Davis

Stefanelus said...

Hi Davis,

I'm trying to compile the library for arm arhitecture, I'm a bit new to Linux and Arm. It is possible ?

On a linux box I tryed with this compiler arm-linux-gnueabi-g++.

Any advice when it comes to arm ?

Stefan

Davis King said...

It shouldn't be any different from compiling it on a Intel machine. What happened when you tried to compile it?

Stefanelus said...

I manage to compile it, I installed X11 and it compiled on arm.

Best regards,
Stefan

Stefanelus said...

Hi Davis,

Regarding ARM, I'm trying to compile dlib without the GUI support, when I try to do so I get some errors

arm-linux-gnueabi-g++ -O3 -I.. ../dlib/all/source.cpp -lpthread file_ex.cpp

some e.g. and I think these errors are caused by X11

/tmp/cc4XTeVy.o: In function dlib::base_window::hide()':
source.cpp:(.text+0xe100): undefined reference to XUnmapWindow'
source.cpp:(.text+0xe10c): undefined reference to XFlush'

There is this pragma which I think should excude de UI from the compilation process

#define DLIB_NO_GUI_SUPPORT

This is just like the DLIB_ISO_CPP_ONLY option except that it excludes only the GUI part of the library. An example of when you might want to use this would be if you don't need GUI support and you are building on a UNIX platform that doesn't have the X11 headers installed.

Do I miss something here ?

Stefan

Davis King said...

If you add -DDLIB_NO_GUI_SUPPORT to your command line options it should build properly.

Stefanelus said...

many thanks, it worked.

Stefanelus said...

many thanks, it worked.

Richy B said...

Is there any way of finding out from the shape predictor (especially in relation to the 68 point face model) the number of actual "detected" points vs the "predicted" points (i.e. some sort of "confidence" value)?

Davis King said...

That algorithm doesn't have any kind of detection confidence. You will have to use some additional tool to estimate the accuracy, for example, by training a binary SVM to recognize when the points are very accurate.

Richy B said...

Many thanks Davis for your reply. On a similiar note, I notice that frontal_face_detector.h is "trained on mirrored set of labeled_faces_in_the_wild/frontal_faces.xml": obviously, this is the LFW image set data from http://vis-www.cs.umass.edu/lfw/ , but where did you get the frontal_faces/left_faces.xml files from? (Basically, I need to build my own facial HOG detector, but with a smaller detection window)

Davis King said...

I manually labeled side views of faces in LFW.

Unknown said...

Hi,
I have been testing several face detector and I must say, that yours is the best, quite impressive. You could write a publication about it:)

I have a question about merging several HOG detector to one. I have found, that you have learned 5 HOG filters. Then it is used as one.
How did you merge all that filters to one?

Davis King said...

Yeah, I have a paper that describes it that I'm trying to publish. It's a tedious process.

The object_detector has a constructor that takes a std::vector of object_detectors and combines them into one. They must all share the same settings (e.g. window size) though.

This comment has been removed by the author.

really cool staff ! about the merging detectors process paper, do you mean <> on arXiv :)

Unknown said...

Does face detection of dlib depend on threads?
I defined DLIB_ISO_CPP_ONLY and included frontal_face_detector.h, but got the error message: "DLIB_ISO_CPP_ONLY is defined so you can't use this OS dependent code.

Is it possible to run face detection with only iso_cpp ?

Davis King said...

The face detector doesn't really depend on threading but it eventually #includes headers that do. Don't worry about it. Just compile it normally and you will be fine.

Unknown said...
This comment has been removed by the author.
Davis King said...

There is no leak. It's just that at program shutdown it doesn't wait until those threads have terminated before allowing the program to finally close because it doesn't matter (the OS reclaims them after the process terminates regardless). Moreover, in some cases waiting for them can cause annoying delays between main() terminating and the program actually ending. So dlib does not wait.

Unknown said...

Thank you.
I just fond microsoft agreed with the false positives, as following.
https://msdn.microsoft.com/en-us/library/x98tx3cf.aspx

Anonymous said...

I am trying to use this libraries. I get an error something like this: undefined reference to dlib::image_window::~image_window()' Dlib_try.cpp

Davis King said...

Anonymous said...

Already tried using that but no success. What else should I check for ?

Davis King said...

That's definitely the problem. That function is in dlib/all/source.cpp and the error is saying you didn't add it to your project. If you don't know how to add source files to your project then use cmake to set it up. There are instructions here: http://dlib.net/compile.html

Anonymous said...

I added the source.cpp to the src folder of my project. Now I get the error :
undefined reference to XDestroyWindow'

Davis King said...

Link to X11. There are instructions here for all of this: http://dlib.net/compile.html

Anonymous said...
This comment has been removed by the author.
Richy B said...

In my project (Centos) based, I had the following in CmakeLists.txt:
include( ../../../../dlib/cmake)

#include "../../../../dlib/image_io.h"
#include "../../../../dlib/revision.h"

Hope it helps! Bare in mind, this was my first every C++ project (I did study "C" over 15 years ago and did a 1 week course in C#.Net over 10 years ago) and following the example code led me to this - so it can't be too difficult to get working.

Unknown said...
This comment has been removed by the author.
Unknown said...

Davis, hello and first thanks for the library! I have been working with it for my project and it's really cool! Currently I'm making a pedestrian detector with my own dataset using dlib and opencv and it's working fine in a static position of the camera.

The question I have now is why it isn't working smooth (like in the example video you put here)? Mine is working like at 4 FPS, and I have enabled SSE2, SSE4 and AVX on the cmake GUI. Also the testing video is 640x480 using a detection window of 70x100. Or is it that my PC can't handle more than that speed?

Davis King said...

I would expect it to be faster than that unless your computer is quite slow. Also, are you sure you enabled compiler optimizations (see dlib.net/faq.html#Whyisdlibslow)?

Unknown said...

Hi,

Thanks for the reply. I think yes, I have them enabled like in the example, then configure and generate. I'm using an i5 2.6GHz x4 with Ubuntu, not sure if that is enough.

It's strange, because when i make the program to show the rectangle of just the first detection he finds, it works okay, but when i tell it to draw all the rectangles it becomes sluggish, even if i use a video with only one person in there.

Unknown said...

Well yes, you were right, something is wrong with the compiler or the cmake. I have an older exe that was working fairly well, now I'm compiling again the same code of the exe but is slow. I' puzzled about where is the error, maybe i clicked something. To compile I'm using cmake --build . --config Release

Unknown said...

nvm, I'm dumb. I had missclicked and checked DLIB_ENABLE_ASSERTS on the Cmake GUI, what does that do? Sorry for making a new post instead of edit an older one but don't know how to do it.

Anyway, thanks for all! This is a great blog.

Davis King said...

No worries

Anonymous said...
This comment has been removed by the author.
Anonymous said...

How do I define DLIB_JPEG_SUPPORT and DLIB_PNG_SUPPORT without making the use of CMake. I want to use jpeg and png files in programs using Dlib libraries. I could not follow the instructions given in your blog.
Kindly Help.!

Davis King said...

You set it via a compiler switch. For example, with gcc it's -DDLIB_JPEG_SUPPORT.

However, it varies from compiler to compiler. Look up your compiler in google to find out what to do.

Anonymous said...

How do I do that when I want to compile codes using eclipse CDT.

Davis King said...

I have no idea, but I'm sure you can find out using google. Alternatively, you can ask cmake to generate an eclipse project.

Anonymous said...
This comment has been removed by the author.
vaibhav06891 said...

Any help to deal with error:

I have already linked jpeg libraries . What else should be done.

Anonymous said...
This comment has been removed by the author.
Davis King said...

There are detailed instructions telling you exactly what to do here http://dlib.net/compile.html. I would recommend using the cmake version of the compile process.

Unknown said...

Hi All
I also have this error "You must #define DLIB_JPEG_SUPPORT and link to libjpeg to read JPEG files."
My OS and IDE are Windows and Eclipse with mingw
Before I use dlib, I have used cmake to link the library of dlib, and also enabled the option "link libjpeg"
But my program (using dlib's face detection example) still occurs the error.
How can I fix it? Thanks.

Davis King said...

There are detailed instructions here: http://dlib.net/compile.html

Unknown said...

Hi Davis,
Thanks for your help. I have found my mistake out. I missed the "#define DLIB_JPEG_SUPPORT" on the top of my program.

mohanraj said...

Dear Sir,

I am trying to configure the DLIB library in Visual Studio, but i am getting errors.

Tell me the procedure to configure DLIB in visual studio

Davis King said...

There are detailed instructions here: http://dlib.net/compile.html

Hi Davis,

Thanks for the awesome library.

Does the fhog detection example uses just one HOG template? I would like to detect cars, but it has some pose and type variability. If it uses just one HOG template wouldn't be a good idea to cluster the images automatically in similar groups (in HOG feature space) and then create different detectors for each one of the groups? Are you planning to do that in the future if it isn't already done?

Davis King said...

No problem.

Yeah, it's just one HOG template. If you want to detect things with pose variation you will need to train separate templates for each pose. I've tried a few different things to do automatic pose clustering but none of them work very well and in the end I always go back to manually creating the pose clusters.

Unknown said...
This comment has been removed by the author.
Unknown said...

Hi Davis,
I am planning to implement a HOG detector based on RGBD data for people torso detection under different poses.According to the Dalal Triggs paper,they compute gradient at a pixel as max of the three computed gradients respectively over the R,G,and B channels.I am planning to add the depth channel to that as well and then train my detector.Can u help me out with where(which file) I should make necessary edits so that I can get it working?Thanks for the gr8 library!

Unknown said...
This comment has been removed by the author.
Unknown said...
This comment has been removed by the author.
Davis King said...

The scan_fhog_pyramid's second template argument defines the feature extractor. By default, it uses this http://dlib.net/dlib/image_processing/scan_fhog_pyramid_abstract.h.html#default_fhog_feature_extractor. If you want to change that then read the documentation there as it tells you what to do.

Unknown said...

Hi, I'm wondering how you select the images for current face detector from the label wild dataset. Right now it has more than 5000 images in the wild. Did you exclude the non-frontal images? Thank you!

Davis King said...

I manually picked a set of images that seemed appropriate.

Unknown said...

It's a great library, thanks! Just a question about the face detector: Is there a way to test the face-likelihood of a single window? (I mean insted of searching the whole image) I'd like to try a particle filter based face tracker, and for that it would be a help.

Davis King said...

The simplest thing is to just run the detector on a cropped image.

Unknown said...

Ok, but than I nned to set the minimal detection window to the cropped image size. Is there a simple way to do it?

Jesse Nicholson said...

Oh my goodness, I don't understand why people are trying out such a complex application when they can't work a compiler. I applaud your patience. Just wanted to say that, and say thanks for the article.

Unknown said...

Hi,

I'd like to use dlib object detector for multiview face detection. I was wondering if anybody has already tried it, if so how good dlib is it for this task?
How many examples at least do I need to have a decent result?

Davis King said...

You just train detectors for each of the views. This blog post and the example programs discuss the details.

Unknown said...

hi i downloaded the file so what should i do then by the way does this program really can tell
who the person is cause i have a picture of somebody i just know she is hip hop dancer and she is in dancin stars group i just want know her name

udaya said...

Hello sir,

i've executed the dlib library to detect faces in linux machine....for few images,its not detecting the face..

Clear face photos also detected as "no face"

if you want me to send those images which is not detectiong,i can send...

can u just give me your mail id?

Unknown said...

Hi Davis,

I'm reading your MMOD paper. I have one question regarding the objective defined in equation 8. Why should we plus delta(y,y_i)? This term will become larger for bad y's when the scoring function F(x_i,y) is actually smaller. Would it work against scoring F this way? Or this is to guarantee a larger margin? Thank you.

Yan

Davis King said...

It's explained in the following paragraph.

Maciek Strömich said...
This comment has been removed by the author.
Maciek Strömich said...

Hi,

I started to play with dlib in Python on my Mac OS X and encountered some issues with compilation against MacPorts version of python2.7 and decided to write down all small fixes. You can check it here. If you know a better way to pass PYTHONLIB:PATH to cmake then feel free to contact me and I will happily update the post with credits :-)

https://suff.wordpress.com/2015/07/30/mac-os-x-python-ports-and-dlib-compilation/ Maybe someone will find it usefull.

Davis King said...

CMake should compile against whatever version of python is in your path. So when you type python at the prompt, whatever version that runs is the one it will compile against. It looks like you compiled and then switched python versions right after. Try compiling after you source a different python environment.

Maciek Strömich said...

Davis, thanks for the tip but sadly it's not working as it should.

suff@airtrufla ~/Downloads $virtualenv-2.7 venv New python executable in venv/bin/python Installing setuptools, pip...done. suff@airtrufla ~/Downloads$ source venv/bin/activate
(venv)suff@airtrufla ~/Downloads $cd dlib-18.16 (venv)suff@airtrufla ~/Downloads/dlib-18.16$ cd python_examples/
[...]
-- Found the following Boost libraries:
-- python
-- Found PythonLibs: /usr/lib/libpython2.7.dylib (found suitable version "2.7.6", minimum required is "2.6")
[...]

based on what you wrote, CMake should take the virtualenv version of Python as it's the one in path. The output of cmake ../../tools/python denies this and after compiling it's segfaulting as before.

Davis King said...

Put a cmake message() call into dlib/add_python_module that outputs the CMAKE_PREFIX_PATH variable. Do this at line 37. It should have the path of the correct python version. What do you see on your machine?

Unknown said...

Davis King said...
It's explained in the following paragraph.

Hi Davis，

I understand what delta mean. I don't understand why we plus delta(y,y_i) instead of minus delta(y,y_i) since F is the gain and delta is the loss. It seems a bit confusing to me to add this two terms together?

Thank you so much.

Yan

Unknown said...
This comment has been removed by the author.
Unknown said...

Blogger Tobias Szesny said...

Hi Davis,

do you think it's possible to train a detector for open doors?
If so, how would i go about defining the truth boxes if most of the time i can only see the left an right edge of the open door?

What aspect ratio can I use for the sliding window?

Greetings
Tobias

August 12, 2015 at 8:30 AM

Davis King said...

I would expect it to work well. Use a box that is the size and aspect ratio of the entire door frame.

Tommi said...

How do you easily annotate objects with a constant aspect ratio? I'm using imglab with train_object_detector example, read the Help-section of imglab but it did not tell how.

Thanks!

Unknown said...

Hi Davis

Have you published the paper about merging several detectors?
I am interested about it, and really want to know how to implement.

Davis King said...

No, all the code does is run each detector and output the combined results. It's far too trivial to publish.

Unknown said...

Hi Davis,

I am trying to train my detector to detect the Eiffel Tower.
To do so I took 10 images from google and started the training.
In the code I changed

scanner.set_detection_window_size(80, 80);

to

scanner.set_detection_window_size(50, 100);

According to the comments in the code, the program should stop if the the risk
gap is lower than 0.01. But for some reason my trainier is already in iter:119 with an risk gap of 0.00893032 and it still continues.
Any idea why it doesent stop?

Thanks in advance for this great library!

Rimac

Unknown said...

[Update]
Ten seconds after my comment it stopped the training.
It worked great with the test images!

Unknown said...

Hi Davis,
Firstly,thankyou for ur library!It has helped me a lot in my research project.

I read through your paper on "Max-Margin Object Detection" and could not understand the following things:

1.How do you obtain the valid labeling for the training set images(annotated using say imglab).We don't know the 'w' vector,so the scoring function is not defined yet..how do u then assign the labels?
2.The size of the detection window is the size of each r(rectangle) belong to Y.Isn't t?
3.Do you fix the no of rectangles in a set...i mean,have a minimum limit?

Davis King said...

I'm not sure I understand your questions. When you label images you just draw boxes on whatever objects you want to detect using a computer mouse and the imglab program. Then that's the input to the training. The whole point of the machine learning method is to find w based on the training data.

Unknown said...

When we train, the input is of the form (x_i,y_i) x_i being the image and y_i being a valid label that satisfies F(x_i,y_i) > max F(x_i,y) over all y not= y_i

The question is : how do we find that valid label that satisfies this condition.According to the paper, the label y_i is a set of non-overlapping rectangles that maximize the scoring function.Since we mark only a few rectangles in every image(using imglab),how does the algo determine this label y_i?

Unknown said...

Is it like when we label using imglab , then we are actually selecting some rectangles..are theses rectangles themselves consituting y_i for each image?And the algo enforces us to use the same aspect ratio while drawing boxes ; why is that important?

I am sorry I am asking so many questions but I am planning to write a research paper and this is the only thing that's not clear to me.

Davis King said...

Yes, that's the idea. You are telling it the rectangles that it should output. They should have the same aspect ratio because the algorithm scans a box over the image and obviously if your rectangles have a very different aspect ratio than the scanning window then it can't ever match your rectangles.

Unknown said...
This comment has been removed by the author.
Unknown said...

But in the Loss Augmented Detection Algorithm/Object Detection Algorithm, we try to estimate y* by taking a set D of rectangles (D1,D2,D3....) which additionally satisfy some conditions. How are these rectangles selected ?

Because if all possible set of rectangles with varying aspect ratios within an image are considered,then the number becomes huge and picking D out of this huge set by checking for the satisfying conditions becomes computationally intensive!

Do u select a rectangle with the given aspect ratio and shift it regularly in x and y directions to get the complete set of rectangles and then choose D from this set?

Davis King said...

Yes, that's what it does. Did you read the paper? It must say "sliding window" at least a dozen times.

Unknown said...

Hi Davis,

Dlib has been very useful to us in Face Landmark Points detection, thank you for that!

We are already using the shape model provided for our application, but we have been trying to train a new model using our own dataset of approximately 65000 images.

With an oversampling factor of 30 and nu=0.2, Dlib required around 100 hours to finish the training. But due to poor results, we tried with higher values. With oversampling at 100 and nu=0.3, Dlib stated that it requires 81000 hours to complete. We let it run for 2 days in the hope that the estimated time might change but that did not happen. Now, trying with oversampling at 50 and nu=0.3, it shows 42 hours remaining.

Could you please shed some light on how the remaining time is calculated?

Are there any other parameters that we should tweak which might help us here?

We are running this on a machine with 8 cores, 16GB RAM and a 3.5GHz processor. Do you have any suggestions on the specifications of the machine used?

Any other tips for training a model would be very helpful!

Thank you.

Davis King said...

You probably ran out of RAM and your computer started swapping, thus making the program much slower.

If it's not working with that much data then there is probably a systematic labeling problem in your data. Maybe it's not labeled very consistently? Or maybe it's too hard for it to learn.

Unknown said...

You were probably right that it was too hard for it to learn. I tried again after removing the extreme cases from my dataset and the time remaining is 98 hours this time. Hopefully, it will complete the training this time.

juan said...

Hello Davis, I want to thank you and congratulate you for your awesome work, the library is truly amazing.

I am experimenting with the frontal_face_detector that comes with the library (the one you get by calling get_frontal_face_detector()) and I was wondering if it was possible to set the detection window size to smaller values, since I cant find a way to access the detector in frontal_face_detector.

Cheers!

Davis King said...

Thanks, I'm glad you like it :)

The detector looks for all faces larger than about 80x80 pixels in size. If you want to find smaller faces you have to upsample the image.

juan said...

Thanks Davis, I'll try that :)

Unknown said...

Hello Davis

I tried the frontal face detection and it was impressive. Now I want to train my own detector. I labelled some 40 images in the LFW database and used them as the training set, but the detection results are disappointing. I guess it's partially caused by my labelling. When labelling the faces, should the bounding boxes be square? Anything else must be noted when labelling?

Davis King said...

You need to label a lot more faces than 40. At least hundreds and preferably a few thousand.

Unknown said...

Hello Davis,

could you please provide some information about how impossible truth boxes are calculated?

I tracked down your code to the part where 'area' and 'total_area' are calculated.
(Link to this part of code: http://pastebin.com/rBqmeRrd)

I dont understand how the Rects in mapped_Rects[] are calculated.

My general goal is to automatically adjust one or more paramter, so that this excpetion won't happen anymore.

Davis King said...

A human supplies the truth boxes, they are not calculated. You just have to make sure they are all more or less of the same aspect ratio and not smaller than the detection window and you will be fine.

Also, this issue is explained in detail at the end of this example: http://dlib.net/fhog_object_detector_ex.cpp.html. You should read that example in its entirety.

Unknown said...

I think you missunderstood my question.
I am not talking about the truth boxes themself. I know that these are drawn by humans. What I would like to know is how the program detects impossible truth boxes.

Thank you again! :)

Unknown said...

Hello Davis,

I tried the code for detecting electric insulator which is not like square shaped like face or stop sign. The ratio of dimension is nearly length:width = 5:1. When I train with horizontal position of the objects it can detect other horizontal insulators with reasonably good accuracy. But when train 45 degree rotation and test with images with 45 degree rotation the accuracy is not good.
I observed, when I rotate it around 45 degree the manually labeled truth box incorporates (large amount of the truth box) non-insulator. So the train images have some portion of non-insulator. Do you think it is the cause not to get good accuracy?
What do you think, the code can be used to detect any dimensional object or suitable only for close to square shaped object?

Davis King said...

It's going to work the best for semi-rigid objects that occupy a majority of the box. So something that is a thin diagonal line isn't going to work so well. I would train only on vertical objects and rotate the image multiple times during training to find other angles. That should give good accuracy.

Unknown said...
This comment has been removed by the author.
Unknown said...

Thanks a lot for the reply with suggestion.
I have some confusion about the suggestion. Should I rotate the image multiple time during training? Or I should train with horizontal insulator only and during testing I should apply trained horizontal detector on different rotated version of test image set. It would be great help if you clarify a little bit more.

Thanks again for your kind help.

Davis King said...

I mean you rotate only during test time.

Unknown said...

Hello Davis,

I've created my own dataset for pedestrians taking roughly 10 images from the INRIA person database and marking the annotations with imglab. I continued to use the "fhog_object_detector_ex.cpp" since I saw your comment about that being good for semi-rigid objects that occupy the majority of the box and have had to use "remove_unobtainable_rectangles(trainer, images_train, face_boxes_train);". I have already changed the sliding detector to (30,80) which is roughly the aspect ratio of my triangles and I believe that MOST of my rectangles should not show up as the "impossible truth boxes".

However, here comes the problem, after the training was complete, it could not detect a single pedestrian from my testing.xml. A thing I noticed was that while running the program using the faces example, the risk was mostly in the region of 0.00XXXX. However, when running the program using my training dataset, the risk was in the region of 2.1XXX. Would you happen to know how I can improve on it?

On a sidenote, I have thought of increasing the sample size to maybe 3000 like you did in your blog post, but does all 3000 annotations have to be done manually via imglab? Or is there a more time efficient manner of doing it?

Thank you so much in advance for your help, I would never have gotten this far without your blog and my non-existent programming knowledge.

Unknown said...

Thanks for the clarification.

tzutalin said...

Hi all,
If you have an interest in HOG detector on Android platform, please feel free to fork and develop together. https://github.com/tzutalin/dlib-android-app

If someone have trained the model about person or pedestrian, could you share your model with me? Thanks in advance.

Unknown said...

Hi,Davis King,

In the Dlib library, is there one function to achieve face recognition ?
Could you me some demo or advice ?

Davis King said...

No. You must train your own model. You can get reasonable face recognition results using http://dlib.net/ml.html#compute_lda_transform and http://dlib.net/imaging.html#extract_highdim_face_lbp_descriptors

Rasel said...

Hi Davis,

Thank you for a nice library and for your nice post. I am trying to train a model for traffic sign detection. The aspect ratio for the truth rect is from 1.0 to 1.20. Dimensions of traffic signs ranges from 16x16 to 128x128. For training, I am using detection window of dimension 16x16. scan_image_pyramid parameter is 6. 0.5 is for matching eps.

I tried the training for around 500 images and did not do any up-sampling. But, got exception, as for almost every image I am getting unobtainable rectangles. Lowering matching eps works, but I am afraid it will affect recall.

Any suggestion about what can be wrong?

iOS Developer said...

i am trying to install the dlib c++ library into xcode ios but i am getting too much linker errors can anyone tell me how to use dlib in xcode ios. my approach was i drag and dropped the the entire dlib folder into xcode project... i also tried to set header paths but its not working.?

Rasel said...

Hi Davis,

Is there any way to put false positive detection as negative object i,e. Bootstrapping?

Davis King said...

You don't need to do anything. The algorithm already considers all image areas that aren't annotated as an object as a negative example.

Kevin Wood said...

Thanks for your patience in addressing all of our questions, Davis. Here I am, back with another. Hopefully this one is easy. :)

I'm using your object detection on ARM, and without being able to use SSE and other instruction sets for optimization, the detector built from the example runs pretty slowly. While I'm wrapping my head around the detection algorithm and the impact of each of the parameters to the trainer, can you suggest which parameters I can tune to speed up detection? Which would have the biggest impact? Many thanks!

Anguo Yang said...

Hi, Davis,

dlib face detection is very accurate, especially when using pyramid_up on images, however, it does not support GPU, so it is much slower than OpenCV when detecting HD(say, 1080P) images - more than 1 second on 1 face!!!, even if it is based on OpenCV, it is not so simple to let dlib to support GPU. Could you please give me advice on how to improve the performance of HD face detection with dlib? is there any other method/ways except for GPU?

Thanks a lot.

Unknown said...

Hi,
how can I pass set of ignore boxes to trainer.train() as a third argument in fhog_object_detector?

Tommi said...
This comment has been removed by the author.
Tommi said...
This comment has been removed by the author.
Tommi said...

Rahul: load_image_dataset returns the ignore boxes, which you can directly feed to the trainer:

detector = trainer.train(images, object_locations, ignore_boxes);

Unknown said...

Thanks Tommi for your time, I just have two more doubts :
1. what is the data type of ignore_boxes?
2. How do i test or reuse the .svm file generated by fhog_object_detector?
I know these are very silly doubts but please help me,i am not used to with c++ that much.

thanks again

/

Davis King said...

You are asking basic programming questions. So the best advice I can give you is to learn C++ first, then try these computer vision applications after that. I would pick at least one of the books from here and read it: http://dlib.net/books.html

Tommi said...

Rahul: check the examples, it's all in there. Don't remember which one. I was trying to write the data types as well, but these comments are interpreting templates as html tags. So I think the dlib forums are much better place to discuss these issues.

Anguo Yang said...

Dlib is only suitable for "logo" alike detection, which has specific shape, for example, the traffic signs.
For pedestrian or head counting applications, it is not a smart choice to use Dlib, as it will show very bad accuracy for you -:)

It is also not suitable for HD-video applications, e.g. 1080P video, which need intensive calculation.
as Dlib did not support GPU(which is internally support in OPENCV), if you want hardware acceleration, you almost have to re-write the Dlib!

Davis King said...

There are plenty of applications of HOG (the thing dlib's detector uses) to pedestrian detection. So it should work fine for that. The same is true for head detection.

Unknown said...

Hi Davis

The areas which are not labeled as ground truth are taken as negative samples in Dlib. How are these negative samples extracted from the image? Does it randomly generate a fixed number of window locations and then use them as the negative samples? How many negatives does it create for one image?

Davis King said...

It trains on all windows in the image. There isn't any subsampling. The full details are explained in the paper referenced in the post.

Unknown said...

Hi Davis,
Thanks for the wonderful library.

Is there a way to get whether the face is pure frontal or frontal-left or frontal-right?

Thanks.

Davis King said...

Thanks!

Yeah, if you call one of the methods of the object_detector that outputs a rect_detection you can look at the weight_index which tells you which of the 5 face poses generated the detection. I forget which face poses are which direction, but you can figure it out by looking at what it does.

Алексей said...

Hi Davis,
I'm trying to enforce the detection of small faces. For this to prevent upsampling and sliding with detection window through a bigger image I decided to re-train the detector against the LFW dataset but with smaller sliding window size. But when I decrease the sliding window size of a trainer the average precision value of got detector also decreases and detections on test images become worse (some faces are missed). Does the value of 80 pixels which you used is the minimal best one for sliding window size?
Can you anticipate if my approach (decreasing the sliding window size) can make the detector run faster on a same image and with detection of smaller images?

Thanks, Alexey.

Алексей said...

Hi Davis,

You commented previously "When using HOG, it's best to cluster the different poses together and train a detector for each pose. That's what I did for the face detector and it gives improved results." Could you, please, explain how to do this? For example if I already have 5 sets of face images with 5 'train.xml' files how can I create a single detector.svm file from them? I checked your default_face_detector with 'num_detectors' method. It outputted 5. This means it was created from 5 datasets. How can I do same?

Thanks for your library, it's really cool!
Alexey

Davis King said...

It's explained at the end of the example program: http://dlib.net/fhog_object_detector_ex.cpp.html

SurzirX said...

Hello Davis,

Firstly, thanks for the amazing contribution - The results from the example look promisiong.

In my project I am trying to perform Traffic light detection (and later recognition). I am wondering if the hog object detector works in real time ?

Regards

Syed

Davis King said...

Thanks :)

HOG generally works well on non-rotating rigid objects. So it will probably work fine on traffic lights.

It runs pretty fast. How fast depends on your hardware and image sizes.

Santhosh B said...

Hi Davis,

Thanks for the dlib. I am working on head detection. Have to detect head in all angles.I have couple of doubts.
1. If size of label is 64x64 and in the test image if there are heads less than 64x64 can it be detected using fhog_object_detector_ex.cpp?
2. How to combine different detectors and pass it to HOG?

Thank you for the help

Davis King said...

Read the example program mentioned in the post (http://dlib.net/fhog_object_detector_ex.cpp.html). All your questions are discussed in the example.

mohanraj said...

Nice library. Am working on face recognition in video. I want to store the rotated and tiled face images in vector. Can you help how to store aligned face images. so that i can extract the face features for face recognition.

Santhosh B said...

Thank you Davis..

Алексей said...

Hi Davis,

Thanks a lot for your really great job!
I'm new in svm and going to implement person identifier using you svm_c_trainer and feature vectors produced by caffe. I was pretty impressed by your svm_c_ex example which describes what to do very good. Thank you!.

By the way I found how to train own face detector and obtain a single detector which contains a couple of internal detectors (for different facial poses) like your default_face_detector. Thanks for your tips!

vaibhav06891 said...

Hi Алексей ,

Can you share the knowledge of how to train our own face detector with me?
I am very curious to learn about it and would really appreciate if you could share some resources with me.

thank you!!

Алексей said...

Hi Vaibhav,

There is no ready-to-use functionality in dlib to train object detector containing more than one internal detectors. You must extend existing class 'structural_object_detection_trainer' with additional own method allowing to calculate not a single but a couple of feature vectors (from a couple of training sets). And then use another constructor of 'object_detector' class which receives a vector of feature vectors instead of a single feature vector.

Regards, Alexey

Davis King said...

You don't need to modify the dlib training code to do this. Just train multiple independent object_detectors. Then put them into a std::vector and use the object_detector constructor that builds an object_detector from a std::vector of other object_detectors.

Алексей said...

Yeah,
I haven't noticed this constructor, Davis! It's my fault. But I had good chance to learn a part of dlib's implementation. Thank you!

Unknown said...

has anyone here used dlib to train a full body detector and the use the facial landmark concept to predict human pose (joints positions) in the body? i am keen on trying this.

Davis King - will this work? anything I should keep in mind for trying this.

thanks

Davis King said...

How well it works depends on the difficulty of your data. HOG was initially designed for upright pedestrian detection. So it should work fine for things like that. But if you wanted to detect acrobats in a circus and predict their poses its definitely not going to work.

videodoes said...

thanks for this great library.

i have been able to connect it to openframeworks.cc toolchain and replicate your face training results.

couple of questions:
1.
I made my own labeling app that outputs the box xml file as per your face labeling example.
If i need ignore boxes, would this need to be a separate xml file? is the internal xml structure/naming the same?
how do i load this new xml file in to a std::vector >, since the face_boxes_train are loaded via load_image_dataset.
i know eventually i have to added as a 3rd element to detector.train();

2.
i am trying to build a detector for overhead views on to people in museums, similar to this view: http://tinyurl.com/h7lmpj6
testing my code with faces returns good (not perfect) training and testing results.
but with the overhead people i get:
training results: 1 0.761364 0.761364
testing results: 1 0.137931 0.137931

it that even possible to make a tracker for overhead people, since they will have different coloured clothing, moving legs, etc.
i was hoping round head and shoulders would be a distinct marker but the legs and clothing mess it up?

Unknown said...
This comment has been removed by the author.
Unknown said...

Hello Davis,

I tried the code for detecting electric insulator. However all the positive images are cropped image. I use these images’information in xml file like

image file="cropImg\DSCF7982_0.jpg"
box height="33" width="119" left="0" top="0"
end image
image file="Neg\Negative-1.JPG"
end image

First image is positive image and use full image as positive area. Second image a negative image which does not contain any positive sample so no box specified. So full image will consider as negative sample.
However the result is much poorer than using uncropped image. I read this two links
http://stackoverflow.com/questions/30326560/object-detection-with-dlib-library
Following your suggesting, I add 16 pixels padding in all positive samples. But that do not enhance the performance. Can you provide any suggest how I should use the cropped image as positive samples?

Unknown said...

I have been using dlib for some time. Recently I am having problem loading the c++ trained .svm in python. If I load the svm using dlib.simple_object_detector. The svm trained using c++ gives "Error deserializing object of type int". If i try loading the dlib.fhog_object_detector.It loads but when try to use for detection. It gives no hits. While the python trained one gives hits on the same the data. This problem has come up recently. Can you help ?

Davis King said...

The python API just calls the C++ code. So there must be something you did differently in your C++ code. Like some kind of preprocessing, using a different color space, etc. that causes the difference.

Anonymous said...

iphone 5s or iphone 6?
dialer

Unknown said...

Hi Davis,
I am using i7, 8 GB RAM, vc11 compiler, 3.60 GHz CPU, with avx instructions enabled to train face detector on 250 images. It is taking around 4 hours to train. In the blog you have mentioned it takes 3 minutes for training 3000 images. Can you please suggest what might be the problem.
Thank You

Davis King said...

http://dlib.net/faq.html#Whyisdlibslow

Unknown said...

Thank You,
but I am using release mode only. Is there any other possible reasons for slow training?

Tommi said...

Are you using huge images? Have you enabled optimization from Visual studio options? The problem can also be some other form of VC++ being slow, instead of dlib being slow.

Unknown said...

are you talking about SSE4, AVX instructions or there are some other optimizations specific to visual studio?

Tommi said...
This comment has been removed by the author.
Unknown said...

HI,

I wonder if there is a nodejs package for this amazing face detection. ?

Unknown said...

Compared the face detection time of opencv and dlib on Odroid XU4. Though dlib didn't give any false detection compared to opencv , it takes around 0.3 seconds to do face detection in dlib, when compared to 0.07 seconds in opencv. I complied the dlib in release mode.

Anguo Yang said...

HI, vijayenthiran subramaniam,
I have the same question, the performance is really a big problem.

Davis King said...

Did you enable AVX instructions? Also, the speed is related to image size. If you give it a huge image it's going to be slow. Make the image smaller and it will run faster.

Davis King said...

Unknown said...

AVS Instruction is only for speeding up the build time right? More over the cpu is Samsung Exynos5422, not sure if it will support AVS instruction.
The image size is 320 X 240 which is half the size of 640 X 480. I event installed openBLAS in it.

Unknown said...

Hi David,

Thank you for your great work! I am trying to detect objects whose shapes might rotate freely (like vehicles in aerial images). When objects rotate in testing images, they are not successfully detected. Do you have any suggestions on that?

Davis King said...

This kind of object detector can't deal with large rotations. So you need to rotate the image many times and run the detector on each. Or train a detector for each rotation.

Unknown said...
This comment has been removed by the author.
Unknown said...

This is really interesting - I was wondering what you use to create the black with white lines describing the detection network? The one the paper calls "R-HOG descriptor"?

Davis King said...

The function that draws the diagram is discussed in the example program.

Anonymous said...

Hi David!
Thank you for your great work! I started to create my own face detector using as base dataset the LFW. From this dataset I separated images into groups (frontal, left side, right side, glasses, dark glasses) and annotated all them. I trained by separated each group obtaining recall over 0.97 for each group. After that I created an unique trained file containing 5 HOG filters as you explained at your blog obtaining good results with test videos. But of course for each HOG filter time needed to process a frame is increased.
Well, following your indications I tried to set a nuclear norm before passing the scanner to the trainer for each group of faces "scanner.set_nuclear_norm_regularization_strength(9.0)". The problem is that in all cases the recall for each group is 0.0 and any face is then detected.

Could you help me what I am doing wrong and how to optimize time when using several HOG filters?

Regards,

Anonymous said...

Hi David!

Thank you for your great work! I started to create my own face detector using as base dataset the LFW. From this dataset I separated images into groups (frontal, left side, right side, glasses, dark glasses) and annotated all them. I trained by separated each group obtaining recall over 0.97 for each group. After that I created an unique trained file containing 5 HOG filters as you explained at your blog obtaining good results with test videos. But of course for each HOG filter time needed to process a frame is increased.

Well, following your indications I tried to set a nuclear norm before passing the scanner to the trainer for each group of faces "scanner.set_nuclear_norm_regularization_strength(9.0)". The problem is that in all cases the recall for each group is 0.0 and any face is then detected.

Could you help me what I am doing wrong and how to optimize time when using several HOG filters?

Regards,

Anonymous said...

Hi David!

Thank you for your great work! I started to create my own face detector using as base dataset the LFW. From this dataset I separated images into groups (frontal, left side, right side, glasses, dark glasses) and annotated all them. I trained by separated each group obtaining recall over 0.97 for each group. After that I created an unique trained file containing 5 HOG filters as you explained at your blog obtaining good results with test videos. But of course for each HOG filter time needed to process a frame is increased.
Well, following your indications I tried to set a nuclear norm before passing the scanner to the trainer for each group of faces "scanner.set_nuclear_norm_regularization_strength(9.0)". The problem is that in all cases the recall for each group is 0.0 and any face is then detected.

Could you help me what I am doing wrong and how to optimize time when using several HOG filters?

Regards,

Davis King said...

Try bigger C values. They need to be larger with a nuclear norm.

Unknown said...

hello David,

i have been using dlib for like 6 months, but i wanted to do some work with the matrix that comes out of the function "draw_fhog", which is type "matrix" i guess, is it possible to convert it to an openCV mat?, the toMat() function does not take it.

saludos

mohanraj said...

am trying to compile dlib 19.0 using camke.

The following errors are occurred for me.

error C1083: Cannot open include file: 'initializer_list': No such file or directory hel
kindly help me to solve the errors.

Thanks
Mohan

Unknown said...

Hi David, How can I speed up face detector with python2?. There is no SSE2, SSE4 or AVX in my machine and it take about 2 second for a picture.

Unknown said...

Hi David. I using machine with no SSE2, SSE4 or AVX support. I compile dlib face_dectector with python and it seem to take about 2 seconds to run per image. How can I speed it up? thank you so much.

Unknown said...

First, I would like to thank you for this great library and its amazing documentation.
I want to train the object detector using the LFW database. I wonder when I can find the labeled_faces_in_the_wild/frontal_faces.xml file?