Monday, February 3, 2014

Dlib 18.6 released: Make your own object detector!

I just posted the next version of dlib, v18.6.  There are a bunch of nice changes, but the most exciting addition is a tool for creating histogram-of-oriented-gradient (HOG) based object detectors.  This is a technique for detecting semi-rigid objects in images which has become a classic computer vision method since its publication in 2005.  In fact, the original HOG paper has been cited over 7000 times, which for those of you who don't follow the academic literature, is a whole lot.

But back to dlib, the new release has a tool that makes training HOG detectors super fast and easy.  For instance, here is an example program that shows how to train a human face detector.  All it needs as input is a set of images and bounding boxes around faces.  On my computer it takes about 6 seconds to do its training using the example face data provided with dlib.  Once finished it produces a HOG detector capable of detecting faces.  An example of the detector's output on a new image (i.e. one it wasn't trained on) is shown below:

You should compare this to the time it takes to train OpenCV's popular cascaded haar object detector, which is generally reported to take hours or days to train and requires you to fiddle with false negative rates and all kinds of spurious parameters.  HOG training is considerably simpler.

Moreover, the HOG trainer uses dlib's structural SVM based training algorithm which enables it to train on all the sub-windows in every image.  This means you don't have to perform any tedious subsampling or "hard negative mining".  It also means you often don't need that much training data.  In particular, the example program that trains a face detector takes in only 4 images, containing a total of 18 faces.  That is sufficient to produce the HOG detector used above.  The example also shows you how to visualize the learned HOG detector, which in this case looks like:

It looks like a face!  It should be noted that it's worth training on more than 4 images since it doesn't take that long to label and train on at least a few hundred objects and it can improve the accuracy.  In particular, I trained a HOG face detector using about 3000 images from the labeled faces in the wild dataset and the training took only about 3 minutes.  3000 is probably excessive, but who cares when training is so fast.

The face detector which was trained on the labeled faces in the wild data comes with the new version of dlib. You can see how to use it in this face detection example program.  The underlying detection code in dlib will make use of SSE instructions on Intel CPUs and this makes dlib's HOG detectors run at the same speed as OpenCV's fast cascaded object detectors.  So for something like a 640x480 resolution web camera it's fast enough to run in real-time.  As for the accuracy, it's easy to get the same detection rate as OpenCV but with thousands of times fewer false alarms.  You can see an example in this youtube video which compares OpenCV's face detector to the new HOG face detector in dlib.  The circles are from OpenCV's default face detector and the red squares are dlib's HOG based face detector.   The difference is night and day. 

Finally, here is another fun example.  Before making this post I downloaded 8 images of stop signs from Google images, drew bounding boxes on them and then trained a HOG detector.  This is the detector I got after a few seconds of training:

It looks like a stop sign and testing it on a new image works great.

All together it took me about 5 minutes to go from not having any data at all to a working stop sign detector.  Not too shabby.  Go try it out yourself.  You can get the the new dlib release here :)