Sunday, February 12, 2017

High Quality Face Recognition with Deep Metric Learning

Since the last dlib release, I've been working on adding easy to use deep metric learning tooling to dlib. Deep metric learning is useful for a lot of things, but the most popular application is face recognition. So obviously I had to add a face recognition example program to dlib. The new example comes with pictures of bald Hollywood action heroes and uses the provided deep metric model to identify how many different people there are and which faces belong to each person. The input images are shown below along with the four automatically identified face clusters:

Just like all the other example dlib models, the pretrained model used by this example program is in the public domain. So you can use it for anything you want. Also, the model has an accuracy of 99.38% on the standard Labeled Faces in the Wild benchmark. This is comparable to other state-of-the-art models and means that, given two face images, it correctly predicts if the images are of the same person 99.38% of the time.

For those interested in the model details, this model is a ResNet network with 29 conv layers. It's essentially a version of the ResNet-34 network from the paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun with a few layers removed and the number of filters per layer reduced by half.

The network was trained from scratch on a dataset of about 3 million faces. This dataset is derived from a number of datasets. The face scrub dataset[2], the VGG dataset[1], and then a large number of images I personally scraped from the internet. I tried as best I could to clean up the combined dataset by removing labeling errors, which meant filtering out a lot of stuff from VGG. I did this by repeatedly training a face recognition model and then using graph clustering methods and a lot of manual review to clean up the dataset. In the end, about half the images are from VGG and face scrub. Also, the total number of individual identities in the dataset is 7485. I made sure to avoid overlap with identities in LFW so the LFW evaluation would be valid.

The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level. The training code is obviously also available, since that sort of thing is basically the point of dlib. You can find all details on training and model specifics by reading the example program and consulting the referenced parts of dlib.  There is also a Python API for accessing the face recognition model.

[1] O. M. Parkhi, A. Vedaldi, A. Zisserman Deep Face Recognition British Machine Vision Conference, 2015.
[2] H.-W. Ng, S. Winkler. A data-driven approach to cleaning large face datasets. Proc. IEEE International Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014


«Oldest   ‹Older   401 – 429 of 429
Tsai Joy said...

Hi Andrey, yes it's Euclidean score not euler, my bad ;)
(Euclidean score = 1.0 - Euclidean distance)

As for the second question, yes there were not-sure regions (Euclidean score 60~70) where the face recognition(FR) had trouble giving correct results. Therefore, I previously skipped these frames completely, due to the inference is a video, and I have many frames of face images to use.

In short, Euclidean score below 60 was set as "Unconfidence" where the inference face is labeled "Unknown" (Unknown means not enrolled in database). In contrast, Euclidean score above 70 will be labeled as the enrolled name in database e.g. "WILL". For Euclidean score between 60~70, false results will occur, i.e. the FR will think it's someone else "John" when it's actually "WILL".

But all these measures are just a work around for Euclidean score, as the SVM score now can give a more definite "-1" as different faces and "+1" as same faces between inference and enrolled. I guess you can say high varianced SVM score (I normalized from -1~1 to 0.00~1.00) is better in FR application, due to differ the "higher known confidence" and "lower unknown Unconfidence".

Andrey Zakharoff said...

@Tsay Joy Hi Joy, I use formula Probability=sqrt(1.- Euclidean distance), also this is not real probability, but probability-like value. Actually,in this way I de-linearize output value. Due to the changed slope of the function I get stretched scores near 0. and shrinked where Euclidean distance coming to 1. Don't you think, your SVM does something like this?

Andrey Zakharoff said...

Please do not SPAM here!

Kasper van Zon said...

Hi Davis,

Thank you for providing this great library!
I would like to speed up the inference step of the network on the CPU by using Intel's OpenVino framework. To do this I would first need to convert your network into a format that the OpenVino model optimizer understands (e.g. Caffe). I have tried to use your "convert_dlib_nets_to_caffe" tool, but I ran into a problem.
The conversion from .dat to .xml did work, but when I ran the tool (./convert_dlib_nets_to_caffe DlibFaceNet.xml 1 3 150 150) it gave the following error:
*************** ERROR CONVERTING TO CAFFE ***************
No conversion between dlib pooling layer parameters and caffe pooling layer parameters found for layer 127
dlib_output_nc: 35
bottom_nc: 72
padding_x: 0
stride_x: 2
kernel_w: 3
pad_x: 1

Any suggestions on howto fix this? It would be great if we can make your network run on the new Intel Movidius Neural Compute Stick 2.


Bill Klein said...

I'm attempting to use dnn_metric_learning_on_images_ex on multiple (4) GPUs for the first time. After much playing with the batch size and the number of data_loader threads, I can't seem to get the typical GPU usage to above ~30%. Any suggestions of what I should be looking at / modifying to keep the GPUs occupied? Thanks!

Davis King said...

The network in that example is probably too small to benefit from 4 GPUs.

Mike said...

Hi Davis,
you recommended the use of an SVM classifier after the DNN ("I would use").

The stated performance of 99.38% on the standard Labeled Faces in the Wild benchmark is achieved using the pure DNN with Euclidean distance measure and without any VCM behind it, correct?

Thanks, Michael

Mike said...

Sorry, I mean ...without any "SVM" behind it ...

Davis King said...

Right, the 99.38% accuracy is without any additional training applied. It's using just the DNN model by itself.

Mike said...

Hi Davis,
do you publish the code used for the Labeled Faces in the Wild benchmark in order for us to duplicate the result?

Davis King said...

The LFW test script is here:

Sara said...

Hello Davis,

Thanks for your great work.
My dataset has 3M people with 60 images per person, Do I need to change the following numbers in code,Do you think the numbers seem to be reasonable?
Does these numbers related to batch size(which is 128 by default)?
const unsigned BATCH_NUM_SAMPLES = 40


Davis King said...

I have no idea what:

const unsigned BATCH_NUM_SAMPLES = 40

is referring to in your code.

Sara said...

I am running "dnn_metric_learning_on_images_ex.cpp" file in dlib/examples.
Sorry, maybe I am using an older version. Those variables were defined in load_mini_batch class. I mean:

num_people = 64
samples_per_id = 40

Davis King said...

Those numbers are fine, the bigger the better usually, as long as your hardware has enough RAM to support such sizes. You should run experiments to see what works best though.

WilliamCorrea said...

Hello Davis.
I'm training my own face recognition model, testing different architectures and loss functions, and comparing them with pre-built models like yours. I came into the comment of your loss function << a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level >>, and I wonder what do you think of it when compared with the triplet-loss (or more recent ones such as, and if you had the time to empirically compare them. Thanks !

Davis King said...

I think it makes more sense than the triplet loss. I'm not sure how it compares to the other recent losses. I think sphereloss is sensible and basically motivated the same way. Really though you should optimize a loss that measures the performance of the model on the task you want to accomplish. The loss I used here in dlib optimizes the binary classification accuracy when using Euclidean distance between vectors to decide if pairs of faces are the same person. Other losses will be more appropriate for other tasks.

It should also be noted though that the quality and size of your training dataset is far and away the most important variable in making a good face recognition model. All your effort should be on that. Other things are micro optimizations.

KBN said...


I did this ->
deserialize("mydatafile.dat") >> learned_pfunct_load_from_modelII;

It worked well.

I ported to a custom platform and could only access mydatafile.dat from the memory location. So how do I deserialize from memory? I tried
char * mydatafileloadedfrommemory; // points to my memory data location


It does not work.

Any pointer please?

Davis King said...

You can also always call deserialize(some_object, some_input_stream). So you just have to make an input stream with the data in it. There are many ways to do this.

KBN said...

Thanks David.

I did the following -

std::istream in(&sbuf); // where sbuf is streambuf.

So, how do I gt this 'some_object'?

deserialize(some_object, in); // <-- where and how do I define this 'some_object'?

Any pointer is much appreciated.

WilliamCorrea said...

Hello Davis,
Thank you for your last explanation about the loss used by dlib. For the sake of completeness, I would have some other questions.

Regarding the creation of the mini-batchs at training time, do you specify a number minimum of images per class (person) ? For instance, if the size of the mini-batch is 32, and I want minimum 4 images per person, I would have at most 8 different persons in that mini-batch ?

Regarding the loss, with a ball-radius of 0.6, would it be something like this the way you computed it ?:

- For two embedding of the same person
difference = np.array(embedding_1) - np.array(embedding_2)
norm = np.linalg.norm(difference, 2)
loss = max(0, norm - 0.6)

- For two embeddings of different persons:
difference = np.array(embedding_1) - np.array(embedding_3)
norm = np.linalg.norm(difference, 2)
loss = max(0, 0.6 - norm)

Thank you very much !

Davis King said...

I don't remember the exact details. I would have to look at the documentation to answer your question. So you might as well just skip that step and look at the documentation yourself ;)

Kim Uyen Nguyen Thi said...

Hello Davis,

Firstly, I really thank you about amazing library.

I am fine-tuning the model "dlib_face_recognition_resnet_model_v1.dat" based on your code from file "dnn_metric_learning_on_images_ex.cpp"

I just tried to fine-tune with your data "examples/johns" and got the error "NAN" with average loss. Like following picture:

Could you please give me some advice about this problem?
Thank you very much.

Kim Uyen Nguyen Thi said...

Hello Davis,
I found your comment on another post.

Thank you so much, it's really helpful for me.

Unknown said...

What is the best way to represent the 128D face vec (std::vector>) in a manner that it could be stored on sql db.

And I wonder if it possible to run the face resnet on CPU only without GPU acceleration?


Unknown said...

There is a problem for me:
Fehler C1128 Die Anzahl von Abschnitten hat das Formatierungslimit der Objektdatei ├╝berschritten: Kompilieren mit /bigobj. ook1 C:\Users\M-Team\Desktop\ook1\ook1\ook1.cpp
what should i do?

Unknown said...

There is a problem for me:
The number of sections has exceeded the formatting limit of the object file

what should i do?

clemw said...

I've looked at the XML output of the part labels (from file examples/faces/training_with_face_landmarks.xml) and it seems to label parts with just numbers instead of names of parts:

<image file='2007_007763.jpg'>
<box top='90' left='194' width='37' height='37'>
<part name='00' x='201' y='107'/>
<part name='01' x='201' y='110'/>
<part name='02' x='201' y='113'/>
<part name='03' x='202' y='117'/>

I presume this means you have to be consist with the ordering?

And what if you are trying to do a multi-nominal image matcher, for things that have different types of interior parts? Then they would have the same numbers.

(Sorry the question is dumb... I'm just starting with image recognition. This is a pretty neat library!)

Davis King said...

The labels can be anything, they don't have to be numbers. But you have to be consistent about their meaning. Like you can't just randomly shuffle the labels on each training sample, obviously.

Things with different types of parts need different models.

«Oldest ‹Older   401 – 429 of 429   Newer› Newest»