Wednesday, April 9, 2014

Dlib 18.7 released: Make your own object detector in Python!

A while ago I boasted about how dlib's object detection tools are better than OpenCV's. However, one thing OpenCV had on dlib was a nice Python API, but no longer!  The new version of dlib is out and it includes a Python API for using and creating object detectors. What does this API look like? Well, lets start by imagining you want to detect faces in this image:


You would begin by importing dlib and scikit-image:
import dlib
from skimage import io
Then you load dlib's default face detector, the image of Obama, and then invoke the detector on the image:
detector = dlib.get_frontal_face_detector()
img = io.imread('obama.jpg')
faces = detector(img)
The result is an array of boxes called faces. Each box gives the pixel coordinates that bound each detected face. To get these coordinates out of faces you do something like:
for d in faces:
    print "left,top,right,bottom:", d.left(), d.top(), d.right(), d.bottom()
We can also view the results graphically by running:
win = dlib.image_window()
win.set_image(img)
win.add_overlay(faces)

But what if you wanted to create your own object detector?  That's easy too.  Dlib comes with an example program and a sample training dataset showing how to this.  But to summarize, you do:
options = dlib.simple_object_detector_training_options()
options.C = 5  # Set the SVM C parameter to 5.  
dlib.train_simple_object_detector("training.xml","detector.svm", options)
That will run the trainer and save the learned detector to a file called detector.svm. The training data is read from training.xml which contains a list of images and bounding boxes. The example that comes with dlib shows the format of the XML file. There is also a graphical tool included that lets you mark up images with a mouse and save these XML files. Finally, to load your custom detector you do:
detector = dlib.simple_object_detector("detector.svm")
If you want to try it out yourself you can download the new dlib release here.

60 comments :

Manuel said...

This library looks amazing!

How can I go and install it? I cannot seem to find a good tutorial.

Davis King said...

The comment at the top of each python example tells you what to do to compile the library.

Tester said...

Hello and thanks for your work.

I tried to train a detector with the .py file you provided. It works well on about 10 images (each about 2000x2000, jpg), but it fails with "Memory Error" on more than 10 images.
Sorry if the solution to this problem is obvious.

OS: Windows 7 64bit (using 32bit Python 2.7)

Tester said...

Oh guess I forget an actual question: do you know why exactly this error occours and how I can prevent it while still training on more images? My goal is to train on some hundreds of images each of the same size.

manas said...

I used the imglab exe to make the file with the boxes. while running the code to build the svm file on certain occasions it fails somewhere so i checked i changed the width and the height to random value it worked but that will increase the chances of misclassifications. How is it the bounding boxes are affecting this process of training?

Davis King said...

What happens when it fails? Is there an error message?

manas said...

Hi Davis,

Theres absolutely no error message the last check point is when it counts the no of images and then the crash

manas said...

so is there a certain aspect ratio to maintained while drawing the bounding box over the object? because certain occasions the default window size 80 x 80 does not seem to work unless changed to 50 x 50. What features should be common? similar height, width , aspect ratio , area etc..

Davis King said...

There is no error message at all? What happens? The program terminates and nothing is output to disk or the screen?

You should try to make all your boxes have a similar aspect ratio.

manas said...

There is absolutely no message on the screen just crashes . i think most of the boxes are made to maintain the aspect ratio. I can share the xml with you if you wish to analyse it?

Davis King said...

Sure, if you can post a complete program that demonstrates the error you are seeing that would be great.

Unknown said...

How do I save the image to a file? I don't have a GUI.

manas said...

I am using evaluate_detectors(), how do i know which detector has returned the true value for the rectangle.?

Thanks

Davis King said...

The documentation for evaluate_detectors() tells you how: http://dlib.net/dlib/image_processing/scan_fhog_pyramid_abstract.h.html#evaluate_detectors

manas said...

Managed to complete the entire thing the only thing that is stopping me is this

I have added the entire training into a function, the training happens fine everything is ok it generates the detector but just crashes at the function exit point. any idea about that?

Tried everything I have a hunch that the thread (dlib::Svm_thread) is not getting released may be. could that be the issue? if so how do i ask the function to wait for the thread to be finished?

Davis King said...

Do the example programs run without crashing if you don't modify them? If yes then there is probably a bug in your code, not in dlib.

Unknown said...

I'm using Dlib to redact people's heads from body camera footage to post at https://www.youtube.com/channel/UCcdSPRNt1HmzkTL9aSDfKuA Should I be making different svm files for the various head positions? How many different videos do I need to train on in order to create a very reliable head detector?

Davis King said...

If you have heads it isn't detecting then yes, you need to train more models for those head poses. A few hundred examples is usually sufficient for the training to give quite good results.

mdfwn said...

Hello Mr. King,
I have two questions:

i) do I have independent control over the width and height of the detection_window_size? I could not set a tuple to this option and I need the detection area to be a non-quadratic rectangle

ii) do I have control over the pyramid size? For a current project, I don't need/want to apply the algorithm on different scales

I tried experimenting and reading the docs, so I suspect the answer is 'no' to both questions. Since these options are available in the C++ implementation: would it be much work to re-compile the c++ code to get a new dlib.pyd file which uses the needed options?

Thanks for your time.

Davis King said...

The python interface picks the best aspect ratio for the detection window based on your training data. So if most of your training boxes are two times as tall as wide then the detection window will be like that too.

If you want more control then you need to use the C++ API rather than trying to modify the python API as that is a lot of work. I mean, you can, but if you have enough ability to modify the underlying C++->Python API implementation then you can just work in C++ in a fraction of the time.

Tim S said...

Hello. Looks cool. I tried following along but am too dumb to install dlib so that python import works.

I followed the usual install instructions as far as

cmake --build . --config Release

which seemed to work but Python remains unaware. Any ideas or is there an idiots guide as to how to do this?

Ta

Tim S said...

Opps - just saw the comment to read the python examples - I'll try that

mdfwn said...

Hello Mr. King,
can you elaborate on the .svm file that is produced (and re-used) by the object detector/trainer?

i) What information is stored in this file?
ii) How can I read and modify it?
iii) Is this exact .svm file compatible with the pure C++ implementation and therefore also usable by this (e.g. if I want to train in python but someday decide to switch to C++)?
iv) Am I dependent on dlib or can I somehow access the svm parameters which are stored in that file (and therefore use it with another SVM module)?

Thanks for your time.

Davis King said...

The python code is just a wrapper around dlib's C++ code. So you can load and use the object detectors without issue in C++.

The file isn't somehow encrypted, so you can read the values out of it and do whatever you want if you were motivated to write your own processing code. It is however highly technical, but all the details are documented in the main C++ side of dlib and in this paper: http://arxiv.org/abs/1502.00046

Unknown said...

How do I go about debugging the script being killed very early on in training?

Davis King said...

Is there an error message?

Unknown said...

No there isn't. The output is
Training with C: 5
Training with epsilon: 0.01
Training using 2 threads.
Training with sliding window 79 pixels wide by 81 pixels tall.
Training on both left and right flipped versions of images.
Upsample images...
Upsample images...
Killed

Davis King said...

Maybe it ran out of RAM. How many images did you give it? Are you compiling a 32bit executable and therefore only able to use 2GB of RAM?

Unknown said...

I gave it 64 images. I have 3GB and am running Linux. On Linux I don't have to force it to compile a 64bit executable right?

Davis King said...

If you are using 64bit linux then everything is just always in 64bits so there isn't anything you need to do to use all the RAM.

I don't know what's happening. Does the trainer work when you run it without modification on the training data that comes with dlib?

Unknown said...

I upgraded to 4 cores and 7 RAM on Azure and now get the below error. Normally when this happens it tells me the filename. It does work with the examples and works off and on with my own images.

I'm having trouble figuring out which image this is.

image index 83
match_eps: 0.5
best possible match: 0.488811
truth rect: [(724, 6) (968, 150)]
truth rect width/height: 1.68966
truth rect area: 35525
nearest detection template rect: [(773, -23) (923, 143)]
nearest detection template rect width/height: 0.904192
nearest detection template rect area: 25217

Davis King said...

It is the image at index 83 in the list of images you gave to the training code.

Unknown said...

Does the indexing start at 0 or 1? I'm going to write a script to tell me the filename of a video by index.

Davis King said...

It starts at 0

Unknown said...
This comment has been removed by the author.
Unknown said...

Can i use the dlib training method to detect people by feeding it pictures of people body shape etc. and track them using a RASPBERRY PI? would it have enough power to do tracking using dlib in real time?

mdfwn said...

Tracking is a little bit more than just detection. You might want to use dlib's Real Time Video Object Tracking: http://blog.dlib.net/2015/02/dlib-1813-released.html

Unknown said...

how to put text on the top of the detection rectangle

Unknown said...

hi davis,
how do i reuse .svm generated by hog_object_detector. I am using visual studio 12 as compiler.

Unknown said...

Thank you Davis
I have created my original detector.
that's result is

Trained with C: 5
Training accuracy: precision: 0.991111, recall: 0.771626, average precision: 0.769863
Testing accuracy: precision: 0.986111, recall: 0.731959, average precision: 0.723225

Trained with C: 10
Training accuracy: precision: 0.991701, recall: 0.82699, average precision: 0.82468
Testing accuracy: precision: 0.975309, recall: 0.814433, average precision: 0.804037

Trained with C: 20
Training accuracy: precision: 0.992248, recall: 0.885813, average precision: 0.883479
Testing accuracy: precision: 0.976744, recall: 0.865979, average precision: 0.854488

Trained with C: 25
Training accuracy: precision: 0.996169, recall: 0.899654, average precision: 0.897599
Testing accuracy: precision: 0.977011, recall: 0.876289, average precision: 0.864523

Trained with C: 30
Training accuracy: precision: 0.996212, recall: 0.910035, average precision: 0.908016
Testing accuracy: precision: 0.967033, recall: 0.907216, average precision: 0.894226

Trained with C: 40
Training accuracy: precision: 0.996255, recall: 0.920415, average precision: 0.918458
Testing accuracy: precision: 0.967033, recall: 0.907216, average precision: 0.895631

Trained with C: 50
Training accuracy: precision: 0.99631, recall: 0.934256, average precision: 0.932443
Testing accuracy: precision: 0.967391, recall: 0.917526, average precision: 0.904212

Trained with C: 100
Training accuracy: precision: 0.996377, recall: 0.951557, average precision: 0.949977
Testing accuracy: precision: 0.9375, recall: 0.927835, average precision: 0.913309

I think C: 30 is the best
What do you think about this?

Davis King said...

Which is best really depends on your application, and in particular, how much you care about different types of errors.

Unknown said...
This comment has been removed by the author.
Unknown said...

Tuning is difficult for me
Thank you Davis

Unknown said...

i am using the Python example, to train a custom object (road sign), but the detection window draws a bigger arbitrary box around the detection area. I thought it would be accurate and just draw it exactly around the matching object. obviously something has gone wrong with the training. has anyone else experienced this before. I resized all boxes to 80x80 and set my detection size to 6400.

Anonymous said...

hi,if I use more training data to train an object detector, does the detection time will be longer than before? For example, I use 50 people to train detector V1, and I use 100 people to train detector V2. Then I use V1 and V2 to detector face, I want to know if the detection time is the same? Many thanks.

Davis King said...

The detection time is always the same.

Unknown said...

how about training time, how different would that be?

Davis King said...

As you would expect, more training data makes training take longer.

Unknown said...

Hi Davis;

Is it possible to use images that contain any target object -so no box in xml for this image- in training?

Davis King said...

Not all images in the training data need labels. Any part of any image that doesn't have a box on it is treated as negative data and the algorithm will learn to not put boxes there.

Unknown said...

How to convert SVM file to DAT file extension?

Davis King said...

Nothing in dlib cares about the file extension.

johnpuskin99 said...

Hi Davis,
get_frontal_face_detector is based on HOG features and a linear classifier (SVM). You call get_frontal_face_detector in face detection programs without deserializing the previous SVM training results. I wonder how get_frontal_face_detector works without training data.

Unknown said...
This comment has been removed by the author.
Unknown said...

Hi Davis

I am working on blind spot detection problem of a vehicle. So, I want to detect cars, motorbikes, pedestrians or any vulnerable vehicles while changing lanes. So, I thought of detecting vehicles using dlib. We wanted to try HOG + SVM detector. I tried using detection window size - 80x80, 60x60, 40x40 etc and also changed pyramid param from 6 to 12. But, it always produces errors as shown below. So, I think the problem is with varying aspect ratios. So, I get errors like below with an exception -

An impossible set of object labels was detected. This is happening because none
of the object locations checked by the supplied image scanner is a close enough
match to one of the truth boxes in your training dataset. To resolve this you
need to either lower the match_eps, adjust the settings of the image scanner so
that it is capable of hitting this truth box, or adjust the offending truth
rectangle so it can be matched by the current image scanner. Also, if you are
using the scan_fhog_pyramid object then you could try using a finer image
pyramid. Additionally, the scan_fhog_pyramid scans a fixed aspect ratio box
across the image when it searches for objects. So if you are getting this error
and you are using the scan_fhog_pyramid, it's very likely the problem is that
your training dataset contains truth rectangles of widely varying aspect
ratios. The solution is to make sure your training boxes all have about the
same aspect ratio.

image index 2
match_eps: 0.5
best possible match: 0.457987
truth rect: [(561, 484) (621, 566)]
truth rect width/height: 0.73494
truth rect area: 5063
nearest detection template rect: [(572, 492) (652, 572)]
nearest detection template rect width/height: 1
nearest detection template rect area: 6561

Would you be kind enough to tell me what does match_eps stands for. I could just understand that its the 3rd image in order, but what do other parameters represent ? So, could you please suggest me how to go ahead to the problem ? We wanted to do it on CPU preferably, if possible

Jay said...

Hi Davis,
In my training data with images, I have six classes which includes a background class (negative class). I would like to know whether it is possible to obtain Multiclass SVM probabilities in dlib. I want the SVM to output not only the class labels but also it's confidence value. Please help me in this regard.


Davis King said...

The output includes the SVM confidence values. Consult the documentation to see how to get it.

Jay said...

Many thanks for the reply.

Unknown said...

Hello David,
First, thank you so much for your work it is really helpful in many ways.

I am trying to retrain the face detector on some thermal images. To do so I am using the Python code train_object_detector.py and I am actually having some issue with the dlib.train_simple_object_detector() function.

My first goal was to train it on 5000 images of dimension 160x120 pixels.

But I have been having some RAM issues. I try to resized the images but then the bounding boxes were too small "smaller than about 400
pixels in area".
So I found out that 500 images were the maximum I could use.
So now I am training on those 500 images and I am always getting:

Training accuracy: precision: 1, recall: 0, average precision: 0
Testing accuracy: precision: 1, recall: 0, average precision: 0

Do you have any idea of what could be wrong in what I am doing?

Thanks a lot

Davis King said...

Your labels are most likely inaccurate or inconsistent in some way. Train on a smaller dataset that you are sure is labeled the way you really want. Get it working on that, then run that resulting model on the other images and see where it disagrees with labels. Or add more images but review them to make sure the boxes are in the right places.