Thursday, August 28, 2014

Real-Time Face Pose Estimation

I just posted the next version of dlib, v18.10, and it includes a number of new minor features.  The main addition in this release is an implementation of an excellent paper from this year's Computer Vision and Pattern Recognition Conference:
One Millisecond Face Alignment with an Ensemble of Regression Trees by Vahid Kazemi and Josephine Sullivan
As the name suggests, it allows you to perform face pose estimation very quickly. In particular, this means that if you give it an image of someone's face it will add this kind of annotation:

In fact, this is the output of dlib's new face landmarking example program on one of the images from the HELEN dataset.  To get an even better idea of how well this pose estimator works take a look at this video where it has been applied to each frame:


It doesn't just stop there though.  You can use this technique to make your own custom pose estimation models.  To see how, take a look at the example program for training these pose estimation models.

91 comments :

Hamilton 漢密頓 said...

well done

Hamilton 漢密頓 said...

well done

Rodrigo Benenson said...

Have you evaluated this implementation quality and/or speed-wise ? How does it compare to the numbers reported in the original research paper ?

Davis King said...

Yes. The results are comparable to those reported in the paper both in terms of speed and accuracy.

Rodrigo Benenson said...

Sweet !

Stephen Moore said...

Does the "real time pose estimation algorithm" use a face detector every frame or use the previous frames output for current frame estimation?

Davis King said...

You can run it either way. The input to the pose estimator is a bounding box for a face and it outputs the pose.

The included example program shows how to get that bounding box from dlib's face detector but you could just as easily use the face pose from the previous frame to define the bounding box.

Amanda Sgroi said...

In the paper, "One Millisecond Face Alignment ..." they output 194 landmark points on the face, however the implementation provided in dlib only outputs 68 points. Is there a way to easily produce the 194 points using the code provided in dlib?

Davis King said...

I only included the 68 point style model used by the iBUG 300-W dataset in this dlib release. However, if you want to train a 194 point model you can do so pretty easily by following the example here: http://dlib.net/train_shape_predictor_ex.cpp.html

You can get the training data from the HELEN dataset webpage http://www.ifp.illinois.edu/~vuongle2/helen/.

drjo said...

I compiled the example from v18.10 and get an error, DLIB_JPEG_SUPPORT not #defined Unable to load the image in file ..\faces\2007_007763.jpg.

Can you please help me out?

Davis King said...

You need to tell your compiler to add a #define for DLIB_JPEG_SUPPORT and then link it with libjpeg.

If you are unsure how to configure your compiler to do this then I would suggest using CMake (following the directions http://dlib.net/compile.html). CMake will set all this stuff up for you.

Xan63 said...

Hi thanks for dlib !
I also have an issue with jpeg (win7, visual and CMake) when compiling dlib :
error C2371: 'INT32' : redefinition; different basic types, in jmorecfg.h

it compiles (and works) just fine without jpeg support

Xan63 said...

answering myself, If I leave JPEG_LIBRAY and JPEG_INCLUDE_DIR empty in my Cmake-gui, then dlib is still compiled with JPEG support, despite CMake telling me: Could NOT find JPEG (missing: JPEG_LIBRARY JPEG_INCLUDE_DIR)
Not sure what is going on, but it works...

Davis King said...

CMake will try to find a version of libjpeg that is installed on your system and use that. If it can't find a system version of libjpeg it prints out that it didn't find it. I then have CMake setup to statically compile the copy in the dlib/external/libjpeg folder when a system install of libjpeg is not found. So that's why you get that message.

More importantly, I want to make sure dlib always compiles cleanly with cmake. So can you post the exact commands you typed to get the error C2371: 'INT32' : redefinition; different basic types, in jmorecfg.h error?

I don't get this on any of the systems I have. The string INT32 doesn't even appear in any code in the dlib folder so I'm not sure how this happened.

Xan63 said...

That explains a lot...
As for the commands, I use Cmake-gui, so I just throw the CmakeLists.txt in there and everything works fine, except that error message about JPEG (Could NOT find JPEG (missing: JPEG_LIBRARY JPEG_INCLUDE_DIR)

If I try to fix it (now i understand that I don't need to) and fill the JPEG_INCLUDE_DIR and JPEG_LIBRARY in Cmake-gui, for example using libjpeg that comes with opencv, then I get this C2371: 'INT32' error when compiling (with visual 2012)

Davis King said...

Ok, that makes sense. I'll add a print statement to the CMakeLists.txt file so it's clearer what is happening in this case :)

Ked Su said...
This comment has been removed by the author.
Davis King said...

That google drive link doesn't work for me. Can you post the image another way? Also, is the image extremely large? That's the only way I would expect an out of memory error.

Ked Su said...
This comment has been removed by the author.
Davis King said...

Huh, I don't know what's wrong. That's not a large enough image to cause an out of memory error. I also tried it on my computer and it works fine.

What system and compiler are you using? Also, what is the exact error message you get when you run the image though the face_landmark_detection_ex example program that comes with dlib?

Ked Su said...
This comment has been removed by the author.
Davis King said...

Cool. No worries :)

Cheers,
Davis

mohanraj said...

I am facing problem, while trying to run face detection program in visual studio 2012.
dlib_jpeg_support not define
how to fix this problem

Davis King said...

Try compiling it with CMake. The instructions are shown here: http://dlib.net/compile.html

mohanraj said...

i compiled the example folder file in dlib with the cmake, how to test the program now

Davis King said...

Then you run the face_landmark_detection_ex executable.

Shengyin Wu said...

Can you tell me the paramters you trained on the ibug dataset ?

Davis King said...

If I recall correctly, when training on iBUG I used the default dlib parameter settings except I set the cascade depth to 15 instead of 10.

Jess said...

I am wondering if you can help me with a speed issue I am having.

I am trying to set up a test using my laptops webcam (opencv) to add the face pose overlay in real time using the example code provided.

The face detector and full_object_detection functions seem to be taking multiple seconds per frame to compute (480x640).

I have compiled dlib using cmake on visual studio 2013 with the 64 bit and avx flags.

I was wondering if you could point me in the right direction to reach this one millisecond number the paper boasts.

Davis King said...

Did you compile in release or debug mode?

Jess said...

Ah, yes that was the problem. I had assumed setting cmake to release would default the library build to release as I only changed the example code build settings in VS.

Thanks!

Shengyin Wu said...

when tranining ibug dataset, did you generate the bounding box yourself, or just use the bounding box the 300 face in wild conpetition supplied?

Davis King said...

I generated the bounding boxes using dlib's included face detector. This way, the resulting model is calibrated to work well with dlib's face detector.

Shengyin Wu said...

if the detector faild to detect the face, how did you generate the bounding box? thanks for you reply.

Davis King said...

In that case I generated it based on the landmark positions. However, I made sure the box was sized and positioned in the same way the dlib detector would have output if it had detected it (e.g. centered on the nose and at a certain scale relative to the whole face).

Emre YAZICI said...

Hello, great work and works very fast.

Thanks.

Is there any method to estimate Yaw, Pitch, Roll with these estimated landmarks?

Emre YAZICI said...
This comment has been removed by the author.
Davis King said...

Thanks.

The output is just the landmarks.

Emre YAZICI said...

Hello,

When I try to train shape predictor with more than 68 landmarks, it fails some assertions "DLIB_CASSERT(det.num_parts() == 68" in lbp.h, render_face_detections.h so on.

How can I train with more landmarks?

Thank you

Davis King said...

Don't call render_face_detections()

Emre YAZICI said...

Thank you. It really works.

One last question for better training.

Do I need to specify the box rectangle [box top='59' left='73' width='93' height='97'] correctly ?

Or can I leave it like 0,0,width,height?

If I need to specify box, do I need to use dlib face detector to locate faces?

Thanks again for this great work

Davis King said...

You have to give a reasonable bounding box, but you can get the box any way you like. However, when you use this thing in a real application you will pair it with some object detector that outputs bounding boxes for your objects prior to pose estimation. So it's a very good idea to use that same object detector to generate your bounding boxes for pose estimation training.

Olivier KIHL said...

Have you only use the ibug dataset (135 images) to train the model shape_predictor_68_face_landmarks ?

Davis King said...

That model is trained on the iBUG 300-W dataset which has several thousand images in it.

Emory Xu said...

Hi, Davis. I get the input from a camera. So the face landmarks are displayed in a real-time video. But the processing time duration between frames is quite slow. How can I make it faster?

Davis King said...

Did you compile with optimizations and SSE/AVX enabled?

Emory Xu said...

yes, I have compiled with optimizations and SSE/AVX enabled. But the speed is still slow. It spends about 2s to landmark one frame then process the next frame...

Emory Xu said...

i compile it with cmake and implement the code in visual studio 2013 with win32 platform

Davis King said...

How big are the images you're giving it?

Emory Xu said...

The input is a real time video loaded from the camera. I load the camera by using opencv functions and set the size of camera as following:
cap.set(CV_CAP_PROP_FRAME_WIDTH, 80);
cap.set(CV_CAP_PROP_FRAME_HEIGHT, 45);

Davis King said...

Then it should be very fast. You must be either timing the wrong thing or you haven't actually compiled with optimizations. Is the executable you are running output to a folder called Debug or Release? How are you timing it?

Emory Xu said...

i run output to a folder called Debug. Oh, i find that face landmarking is not slow instead face detection step is slow...How can I make it faster? thx

Davis King said...

If the executable is in the Debug folder then you haven't turned on optimizations. Visual studio outputs the optimized executable to a folder called Release.

When you open visual studio and compile the executable you have to select Release and not Debug or it will be very slow.

Emory Xu said...
This comment has been removed by the author.
Emory Xu said...

Wow!Davis, thanks so much!It seems that Debug folder then I haven't turned on optimizations. When I turn it to Release mode, the speed is fast enough!!! Thx!! :P

Davis King said...

No problem :)

Emory Xu said...

Hi, Davis. I have another question for this example. In this algorithm, the face can be normalized and shown in a small window by the following code:

extract_image_chips(cimg, get_face_chip_details(shapes), face_chips);
win_faces.set_image(tile_images(face_chips));

How can I get the coordinates of face landmarks after the normalization. I mean the coordinates according to the small face window.

Thank you so much.

Davis King said...

Take the output of get_face_chip_details() and give it to get_mapping_to_chip(). That will return an object that maps from the original image into the image chips which you can use to map the landmarks to the chip.

Emory Xu said...

Should I write the function as following?

get_mapping_to_chip(get_face_chip_details(shapes));

And how to return the object?
Thx

Davis King said...

No. Look at the signatures for those functions and you will see how to call them. The object is returned by assigning it into a variable with the = operator.

It would be a good idea to get a book on C++ and it's syntax. I have a list of suggestions here: http://dlib.net/books.html

Emory Xu said...

Dear Davis, I have already tried my best to understand how to call these functions. But still do not know how to do the transformation and the code I wrote as the following:

std::vector v = get_face_chip_details(shapes);

for (int i = 0; i < v.size(); i++){
point_transform_affine p = get_mapping_to_chip(v[i]);

}

My question: How to use the point_transform affine p to return an object? Thx for ur help.

Davis King said...

You map the points in each shape to the chip with a statement like this:

point point_in_chip = p(v[i].part(0));

Emory Xu said...

I wrote as this:

for (int i = 0; i < v.size(); i++){ point_transform_affine p = get_mapping_to_chip(v[i]);
point point_in_chip = p(v[i].part(0));
}

But v[i] is chip_details which has no member part()...I cannot get the points

Davis King said...

Oops, I meant p(shapes[i].part(0))

Emory Xu said...

Yes!!!
It really solved my problem!!!
Thank you so much.Davis, u r so kind and patient!!

Davis King said...

No problem :)

Karla Trejo said...

Hi Davis, this is an excellent work!!

I've been trying to apply this to another different shape, training 4 images (640x480) with 180 landmarks each and default parameters of the train_shape_predictor_ex. Turned ON the SSE2 optimizations in the CMakeCache file and compiled in Release mode on Ubuntu.

It's been 3 hours and keeps saying "Fitting trees..."

I don't know what's wrong, I already tried this shape before with less landmarks (68) and bigger size images and seemed to worked kind of properly. But now even with the optimizations is hanging or something.

I was wondering if you have any suggestion to overcome this problem. I'm thinking about reducing the oversampling amount from 300 to 100 to see how it goes...

Thank you in advance.

Davis King said...

With just 4 images it shouldn't take more than a minute to run, if that. Did you run the example program unmodified or have you changed the code?

Karla Trejo said...

First I ran the example unmodified, it worked.
Then I changed the code a little bit, basically just removed the interocular distance part because I don't need it for this object, load four random size images with 68 landmarks, it worked.
Now I feed four 640x480 images with 180 landmarks and gets stuck...

Davis King said...

Try running it in CMake's Debug mode for a few minutes. That will turn on a lot of checks that may tell you what you did wrong. E.g. maybe your objects don't all have the same number of points in them.

Karla Trejo said...

Oh my god, that was it!! I messed up in the numeration of a landmark and then one of the images had 179 instead of 180 landmarks. Thank you so much!! *--*

Everything works fine. I'm now adjusting the render part. I thought that only by doing this:

for (unsigned long i = 1; i <= 179; ++i)
lines.push_back(image_window::overlay_line(d.part(i), d.part(i-1), color));

would be sufficient as my object has a closed shape (all the landmarks connect sequentially and the last landmark connects with the first landmark).

But the drawing is not quite what I expected, in some point lines are crossing... any suggestions about this?

Again, thank you VERY much!

Davis King said...

No problem.

That code is going to connect the dots sequentially with lines. If they cross then they cross. If that isn't what you expect then your labeling must be wrong in some way.

Karla Trejo said...

I'll check that out thoroughly then, thanks for everything Davis!!

You've been very helpful and I appreciate the detailed atention you give to us :)
Thank you for your time and patience.

Dlib is awesome!

Cheers,
Karla

Davis King said...

Thanks, no problem :)

Cheers,
Davis

SoundSilence said...

Hi, would you please let me know the memory usage as well as the model size of your methods? Thank you.

Eugene Zatepyakin said...

Can you please share the configuration you used for training 68 landmarks model?
I see that amount of cascades as well as trees and depths are different from the default settings.
it would be great to know you experience about choosing this settings. also amount of padding and how does it affect results.
Thank you!

Davis King said...

The model file is about 100mb. Dlib comes with an example program that runs this algorithm, so you can run that program to see exactly what kind of computational resources it consumes.

As for training parameters, I believe I used the defaults except that I changed the cascade depth to 15. If you want insight into how the parameters effect training then I would suggest playing with the example program and reading the original paper as it contains a detailed analysis of their effects.

jjshin said...

I downloaded your landmark detection program and It works well in the single image.

I assumed that a single image is given continuously.

Then, I give previous frame's shapes to the current frame's initial shape. (I added this function to shape_predictor class.)

Then, the shape in the first frame was good but the shape is crushed as time goes even though all images are same.

I think there's some strategies to solve this problem for the tracking in the video.

Do you have this kind of experience?

Davis King said...

Yes, doing that is definitely not going to work. If you want to track faces just run the face detector on each frame and look for overlapping boxes.

jjshin said...

Then, to make the video on this post, you detect the face on every frame, and start with mean shape on the detected bounding box. Am I right?

Davis King said...

Yes. It's just like in the example program.

Emory Xu said...

Hi, Davis.
I want to overlay a string on one image window (eg.win), which function I should use?
should I write win.add_overlay(string ..) something like this?

Also, I want to combine 2 windows(eg. win and winc) in to one window,which function I should use?
After reading the dlib, I failed to find such functions. Could you help me?
thank you so much!!

Davis King said...

The documentation for image_window can be found here: http://dlib.net/dlib/gui_widgets/widgets_abstract.h.html#image_window

Emory Xu said...

Yes, Davis, I read the documentation for image_window, but I am so stupid and cannot find a function to display a string/text on the image window...
Could you do me a favor? Badly needed. Thanks so much!

Davis King said...

Look at this one


void add_overlay(
const rectangle& r,
pixel_type p,
const std::string& l
);

Emory Xu said...

Yes!It works!
Davis, I cannot thank you more!!
BTW, Is it possible to put one smaller image_window on one corner of another bigger image_window?

Davis King said...

No. You have to build an image yourself that looks like that and give it to the image_window.

Chris Collins said...

fantastic work. Do you have any suggestion for head roll, pitch and yaw?

Davis King said...

You can certainly calculate roll/pitch/yaw based on the positions of the landmarks. However, I don't have anything in dlib that does this calculation.

JonDadley said...

Hi David. Thanks for your fantastic work and contined support. As with Chris Collins' comment, I'm looking to calculate the yaw/pitch/roll based on the landmarks. Do you have any advice on how to go about this given that, as you say, dlib doesn't handle this? Any help you could give would be much appreciated.

Davis King said...

You should read about projective transformations. E.g. http://en.wikipedia.org/wiki/3D_projection, http://en.wikipedia.org/wiki/Projective_geometry