Sunday, February 12, 2017

High Quality Face Recognition with Deep Metric Learning

Since the last dlib release, I've been working on adding easy to use deep metric learning tooling to dlib. Deep metric learning is useful for a lot of things, but the most popular application is face recognition. So obviously I had to add a face recognition example program to dlib. The new example comes with pictures of bald Hollywood action heroes and uses the provided deep metric model to identify how many different people there are and which faces belong to each person. The input images are shown below along with the four automatically identified face clusters:




Just like all the other example dlib models, the pretrained model used by this example program is in the public domain. So you can use it for anything you want. Also, the model has an accuracy of 99.38% on the standard Labeled Faces in the Wild benchmark. This is comparable to other state-of-the-art models and means that, given two face images, it correctly predicts if the images are of the same person 99.38% of the time.

For those interested in the model details, this model is a ResNet network with 29 conv layers. It's essentially a version of the ResNet-34 network from the paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun with a few layers removed and the number of filters per layer reduced by half.

The network was trained from scratch on a dataset of about 3 million faces. This dataset is derived from a number of datasets. The face scrub dataset[2], the VGG dataset[1], and then a large number of images I personally scraped from the internet. I tried as best I could to clean up the combined dataset by removing labeling errors, which meant filtering out a lot of stuff from VGG. I did this by repeatedly training a face recognition model and then using graph clustering methods and a lot of manual review to clean up the dataset. In the end, about half the images are from VGG and face scrub. Also, the total number of individual identities in the dataset is 7485. I made sure to avoid overlap with identities in LFW so the LFW evaluation would be valid.

The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level. The training code is obviously also available, since that sort of thing is basically the point of dlib. You can find all details on training and model specifics by reading the example program and consulting the referenced parts of dlib.  There is also a Python API for accessing the face recognition model.



[1] O. M. Parkhi, A. Vedaldi, A. Zisserman Deep Face Recognition British Machine Vision Conference, 2015.
[2] H.-W. Ng, S. Winkler. A data-driven approach to cleaning large face datasets. Proc. IEEE International Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014

275 comments :

«Oldest   ‹Older   201 – 275 of 275
Suren Tamrazyan said...

Hi Davis,
I'm interested in your opinion, does it make sense to take the first few layers of resnet(vgg, inception), freeze their weights, add new layers and train using a set of faces.

Tapas said...
This comment has been removed by the author.
Tapas said...

Hello Davis,
I got it working. I simply created an object of dlib.rectangle by giving the image information as constructor arguments and passed as second argument to facerec.compute_face_descriptor. It working.

Thanks

Jumabek Alikhanov said...

Hi Davis,
Thanks for this cool stuff,
I wonder about the face descriptor computation time.
My core-i7, SSD, GTX1080 GPU, takes 0.35 sec to extract feature for a single face without any image jittering or augmentation.

Is that normal?
It seems to slow to me somehow for real-time purposes?

Davis King said...

That's very slow. You are probably not using cuda, blas, or any other such optimizations. When you compile cmake will print messages telling you what it's doing. You can see if it's using these things.

miguel said...

Hi Davis,

First of all congratulations on your work! Really impressive, and dlib is for me one of the
greatest ML tools around.

I am trying to retrain you model for my type of images (works very well, still I would to train on my own set.). I have close to 500K identities, the problem is that I have 1M images (two per subject). Do you think I still can get a good model even without several images per subject?

Thanks in advance,
Miguel

Davis King said...

Thanks, I'm glad you like dlib :)

You can try with only 2 per subject, although I'm pretty sure the resulting model isn't going to be very good. The general consensus in the research community seems to be that you need a lot of within-class examples to learn this kind of model. That's also been my experience as well.

miguel said...

I understand what you're saying, but I will need to give it a shoot since my case is very specific. Other way might be to do some transfer learning, I don't know exactly how I can do it but I will need to take a look.

Anyway, thank you a lot.

Cheers,
Miguel

AMG4ever said...

Hello Davis, I wander, is it possible to "straighten" a detected face using dlib? Here's an example :
https://i.stack.imgur.com/4Y9HD.jpg

I only care about landmarks

Davis King said...

You can rotate them upright. But there isn't any 3D face warping in dlib if that's what you are asking about.

Siddhardha Saran said...
This comment has been removed by the author.
mehmet ali atici said...

instead of http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2

Can I use my own landmark detector dat file which is trained by dlib_shape_detector but detects 90 points not 68?

mehmet ali atici said...
This comment has been removed by the author.
Davis King said...

You can do whatever you want so long as the faces are cropped and aligned the same way as the dlib example shows.

Саша Таранов said...

It is no doubt awesome post! Great job Davis KIng! And Thank you!

Саша Таранов said...

It is no doubt awesome post! Great job Davis KIng! And Thank you!

Davis King said...

Thanks :)

Саша Таранов said...

Thank you Davis King! It works pretty good!

Саша Таранов said...

I have performed the so called t-SNE on the faceScurb database (5 pictures per person) with Dlib face descriptor. Maybe it would be interesting to someone other, - final picture could be found here https://drive.google.com/file/d/0B7JrJeplhKLveFgyRDlPNHBUR28/view?usp=sharing

Giorgos B. said...

Hello Davis,
I would like to train an object detector with 8 classes (dog,cat and some other animals) but i'm willing to execute it with a video as input and so, i would like it to be as quick as possible. I've already tested the train_object_detector.cpp, but it is really slow and decreases the video's frame rate (due to high resolution). Which is the fastest detector i could use? Is there any particular solution you could propose?
Really thanks, i've been working months now on dlib, and still i have many things to learn... :)

Davis King said...

The HOG detector in that example is the fastest one available. Also, http://dlib.net/faq.html#Whyisdlibslow

Bill Klein said...

I noticed in my tests that:

1) A face without a mouth visible got detected as a face

2) When comparing the descriptor of said mouthless face, with the descriptor of a face of the same person with a mouse, we still get a very close distance. I.e. it correctly finds them to be the same person!

Am I right in assuming that this is because the mouth data is not used in the descriptor?

(I know that this is not a dlib-specific question and has more to do with the deep learning involved, but I'm not sure where else to ask this. Any tips for other forums?)

Thank you!

Davis King said...

No, the whole face crop is used in the computation, including the mouth. The point of this thing is to be robust to all kinds of changes to someone's face that still preserves their identity. So it's good that this works like this.

kim said...

Hi, first I have to say: great work, Dlib has helped me a lot.

Now for the question, is it possible to identify which image gave which face after the clustering?

Thanks :)

Davis King said...

Thanks, I'm glad you like dlib.

Yes, you can find that out. Look at the code. It's trivially available information in the example program.

Sobhan Mahdavi said...

Hi dear Davis, thank you very much for your works,
The mmod_human_face_detector is a great model. As you mentioned, your face detector for preprocessing of face recognition is HOG-based frontal face detector.
Can I use dnn face detector model in face recognition and have the same performance?

Davis King said...

You can use any detector so long as you are able to align the faces the same way. To do this with the CNN model you need to use the 5-point face landmarking model released with the newest dlib. When using the 5-point model you can use either the CNN or HOG face detector and they will both give the same performance.

Sobhan Mahdavi said...

Dear Davis, Thanks for your quick reply
I used CNN model with 5-point face landmarking, but I have an error:

Error detected in file c:\dlib\dlib\image_transforms/interpolation.h.
Error detected in function struct dlib::chip_details __cdecl dlib::get_face_chip_details(const class dlib::full_object_detection &,const unsigned long,const double).

Failing expression was det.num_parts() == 68.
chip_details get_face_chip_details()
You must give a detection with exactly 68 parts in it.
det.num_parts(): 5

The code is here:
auto shape = sp(img, det);
matrix face_chip;
extract_image_chip(img, get_face_chip_details(shape, 150, 0.25), face_chip);
face = move(face_chip);
matrix face_descriptor = net(face);

Can you help me?

Davis King said...

Think about it. You are trying to use a model that wasn't created until dlib 19.7, but you are using an older version of dlib. How can that work?

Mayur Patel said...

Hello Davis
i have installed CUDA then compiled dlib and opencv its working complete. i want to know if "dlib.face_recognition_model_v1.compute_face_descriptor" function utilizing CUDA?if not i have to write wrapper for python? or something like that. bcz i have to performance diff before and after installation of CUDA?
Thank you.

Davis King said...

dlib.face_recognition_model_v1.compute_face_descriptor uses CUDA.

Mayur Patel said...

but there is no speed diff after and before compiled with CUDA? does it mean that CUDA makes not difference in this function?

Davis King said...

Look at the CMake output when dlib is built. It will tell you if it's using CUDA or not.

Duc Vo said...

hi Davis,

I found out this line of code: std::vector> face_descriptors = net(faces); takes most time. Each face image will takes around 300ms to convert into face descriptor. Any chance to reduce that?

Thanks.

Davis King said...

Be sure to link to the Intel MKL if running on the CPU, or even better, use a fast GPU.

Duc Vo said...

Yeh I compiled with CUDA and it runs faster now. Thanks :)

Mayur Patel said...
This comment has been removed by the author.
Mayur Patel said...

hi Duc how much time it takes now after compiling with cuda? mine is 170ms after and before cuda but in both cases with Intel MKL how much yours now?

Duc Vo said...

This line of code

std::vector> face_descriptors = net(faces);

takes 0.44 seconds on 17 faces so it is around 23ms per face. My GPU is Nvidia 870M GTX.

Cheers,

Mayur Patel said...

which compiler you used? and platform is windows? and how did you compiled with cuda can you tell me?

Duc Vo said...

I compile on Linux.

The standard way to compile is to use Cmake as recommended by Davis. I first run cmake-gui to enable DLIB_USE_CUDA, and other options like DLIB_JPEG_SUPPORT, DLIB_PNG_SUPPORT, etc... After that just run following commands( refer here: http://dlib.net/compile.html )

cd examples
mkdir build
cmake-gui ( this is when you enable CUDA as mentioned above )
cd build
cmake ..
cmake --build . --config Release

Good luck.

ps: you said it takes 170ms, which is per face or on the whole vector of faces?

Mayur Patel said...
This comment has been removed by the author.
Mayur Patel said...
This comment has been removed by the author.
Ashok Bugude said...

Hi Thanks for the great work

Can I please know if there is any way to get the names of the person for each of clustered group.

Basically wanted to train say 5 sets of people and recognize them in an image

Tapas said...

Hello Davis,
Thanks for such a nice library.
I am facing a problem. In continuing to my comment posted on August 24, 2017 at 3:21 AM.

Once I get the face bounding box from video frame, I am extracting 128D features from bounding box. As I got the bounding box of detected face, I saved the face to disk. Reloading the saved face and again extracting 128D features(using dlib.rectangle on whole face-image)...These features are not matching with previous bounding box features. Why are the features not matching ?

Thanks
Tapas

Bill Klein said...

I'm wondering if anyone has experimented to determine the minimum face resolution that will result in a reliable computation of the descriptor. For example, will a 40x40-pixel face be comparable to faces of higher resolution?

mehmet ali atici said...

Hi Davis;
in python example, what is the role of shape in line ... compute_face_descriptor(img, shape)? Does the recognition model calculate the descriptor according to five landmarks? if so, is it possible to use another shape models -for example- that find 10 landmarks? in this case, does the distance threshold (0.6) change?

thanks in advance.

Davis King said...

The landmarks are only used to align the face before the DNN extracts the face descriptor. How many landmarks you use doesn't really matter.

mehmet ali atici said...

Hi Davis,

It seems that the size of output layer of the network model is 128 which corresponds to 128D vector. How the 128D vector is extracted from an image for training?

Davis King said...

This example program shows how to train the model from images: http://dlib.net/dnn_metric_learning_on_images_ex.cpp.html

Jon Hauris said...

Hello Davis, How do I determine which image the "image index" refers to. Specifically:
I am training my own detector and received the following:
"RuntimeError: An impossible set of object labels was detected ..."
1. It said that the problem was "image index 1017". How do I find which image this is referring to in the xml file?
2. It also give the "truth rectangle" and "nearest detection template rect:" with their bounding box params. None of which match any of my bb's. What are these rectangles referring to?
3. Where do I adjust the "match_eps"
Thank you, Jon

mehmet ali atici said...

Hello Davis;
Do You plan to provide python api for dnn metric learning?

Thanks.

Davis King said...

Like this? https://github.com/davisking/dlib/blob/master/python_examples/face_recognition.py

mehmet ali atici said...

No, I mean python equivalent for http://dlib.net/dnn_metric_learning_ex.cpp.html

Davis King said...

No, I'm not going to add that since it's impossible to define the network architecture from python.

Mike said...

Hi Davis,
have you ever tried it on a Jetson TX2. I wonder how fast it would be?
Is there any chance to optimize it on an ARM Cortex A7 (dual core) reaching app. 3-5fps?
Or would you rather say forget it?
Thanks!

Davis King said...

I haven't used a Jetson, but that's a very popular way to run these DNNs. Most people find the performance to be quite reasonable.

Bill Klein said...

Mike, I had posted a question about this in the TX2 forum after I did a bit of testing:

https://devtalk.nvidia.com/default/topic/1025670/jetson-tx2/dnn-face-detection-performance-on-tx2/post/5217046/

At least for my particular test (dlib DNN-based face detection on high-res images), it appears that the Jetson TX2 is ~10x slower than a GTX 1070. Please do let us know what you find. :)

Mike said...

Bill, thanks for your feedback. I have measured the execution time for the dnn_face_recognition_ex on a Jetson TX2.
It is a Release compile (though with debug info) using CUDA.
The time is exclusive of loading the data and displaying the images, just the inner execution: 4204 ms.
Is that in line with your findings?
I will get rid of the debug info and play with compiler settings.

Bill Klein said...

Hey Mike,

Firstly, I haven't executed the dnn_face_recognition_ex example specifically. Sorry. Secondly, be sure to do a few iterations of whatever you are trying since the first few (?) may be much slower than the others, due to things being initialized the first time around...

Mike said...

Hi Bill,
I do run the example several times but it does not get any better that 4,2 seconds. It detects 24 faces in total of 4 different guys, so my take is 5-6 fps. Any hints on compiler options to check?
Next, I will compare this result with a standard dual core ARMv7@1GHz, eventually using NEON and VFP support...

Mike said...

Hi Bill,
the same code on an ARM Cortex-A7 @1GHz takes 130 seconds. It is a release compile with -march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -ffast-math -Ofast ...

Bill Klein said...

Sounds like a decent speed-up. :)

Mike said...

Hello Davis,
I am right in assuming that the dnn_face_recogntion_ex as of V19.7 uses the HOG-based frontal face detector and NOT the CNN-based one? So I should be able to expect performance improvements for the landmark extraction by employing NEON?

Davis King said...

Yes, that example uses HOG. You could just as easily use the CNN face detector with it though.

Tsai Joy said...

Dear Davis:
Recently i've been wanting to implement the OpenMP "#pragma omp parallel for"
in the dlib\matrix\matrix_default_mul.h
specifically like this:

for (long r = lhs_block.top(); r <= lhs_block.bottom(); ++r)
{
for (long c = lhs_block.left(); c<= lhs_block.right(); ++c)
{
const typename EXP2::type temp = lhs(r,c);
#pragma omp parallel for //<----------------inserted OpenMP line here
for (long i = rhs_block.left(); i <= rhs_block.right(); ++i)
{
dest(r,i) += rhs(c,i)*temp;
}
}
}
however when i run examples e.g. dnn_face_recognition_ex.cpp
i don't see multi-core processing (via Intel's VTUNE tool) when the get 128D line runs:
std::vector> face_descriptors = net(faces);

Where should i put the "#pragma omp parallel for" to enable OpenMP multi-core processing in dlib?

Davis King said...

Linking dlib with the Intel MKL is a much better approach to get the kind of speed boost you are looking for. The CMake scripts are already setup to do it if you install the MKL.

Mike said...

Hi Davis,
are the dlib-for-ARM improvements by fastfastball now part of dlib-19.7, e.g. SIMD for NEON and threading?
Or would I have to redo the changes for 19.7?

Davis King said...

All that stuff is now part of the main dlib codebase, so yes, it's there. You don't need to do anything to get it.

Kevin Tian said...

Hi Davis,

Thank you for your great work! It really help me a lot in my project.

Regarding to the accuracy rate of 99.38% on LFW, do you only do the test on 1680 people pictured with more than one photo? The LFW says there are some incorrectly labeled photos. How do you process these photos? Manually correct them or ignore them in your test?

How do you do the recognition testing? Do you calculate all photos' 128D feature and then compare with each other and see whether the distance between same person's photo is less than 0.6?

Thank you in advance!

Kevin

Davis King said...

I follow the exact evaluation protocol laid out by the LFW challenge. This file contains the entire test script for the dlib model: http://dlib.net/files/dlib_face_recognition_resnet_model_v1_lfw_test_scripts.tar.bz2. You can run it and see the LFW evaluation outputs.

Kevin Tian said...

Hi Davis,

Thank you for your reply!

I have a question about training face model. Could you give me some comments?

If I increase face images from 3 million to 6 million. Then, will the trained model work better to verify person's face? For example, accuracy rate is increased and false positive rate is decreased.

According to your experience, is there an upper boundary for the recognition capability? It means that recognition capability will not increase, although the number of training face image increases.

Best regards,
Kevin

Davis King said...

Yes, more data is better. For instance, Google trained a face recognizer on 200 million faces and got great results.

Yury Savitskiy said...

Hello Davis,

Could you please answer how do you forming mini batch?

Do you take some number of different persons and some number of their unique images? For example, you choose 64 persons and for each take 8 images so the size of mini batch will be 512. Or you just take some random images and for one person you have 10 images for second 3 and so on.

Yury Savitskiy

Davis King said...

You can make the mini-batches any way you want. To see what I did, refer to the metric learning example program.

«Oldest ‹Older   201 – 275 of 275   Newer› Newest»