Sunday, February 12, 2017

High Quality Face Recognition with Deep Metric Learning

Since the last dlib release, I've been working on adding easy to use deep metric learning tooling to dlib. Deep metric learning is useful for a lot of things, but the most popular application is face recognition. So obviously I had to add a face recognition example program to dlib. The new example comes with pictures of bald Hollywood action heroes and uses the provided deep metric model to identify how many different people there are and which faces belong to each person. The input images are shown below along with the four automatically identified face clusters:




Just like all the other example dlib models, the pretrained model used by this example program is in the public domain. So you can use it for anything you want. Also, the model has an accuracy of 99.38% on the standard Labeled Faces in the Wild benchmark. This is comparable to other state-of-the-art models and means that, given two face images, it correctly predicts if the images are of the same person 99.38% of the time.

For those interested in the model details, this model is a ResNet network with 29 conv layers. It's essentially a version of the ResNet-34 network from the paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun with a few layers removed and the number of filters per layer reduced by half.

The network was trained from scratch on a dataset of about 3 million faces. This dataset is derived from a number of datasets. The face scrub dataset[2], the VGG dataset[1], and then a large number of images I personally scraped from the internet. I tried as best I could to clean up the combined dataset by removing labeling errors, which meant filtering out a lot of stuff from VGG. I did this by repeatedly training a face recognition model and then using graph clustering methods and a lot of manual review to clean up the dataset. In the end, about half the images are from VGG and face scrub. Also, the total number of individual identities in the dataset is 7485. I made sure to avoid overlap with identities in LFW so the LFW evaluation would be valid.

The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level. The training code is obviously also available, since that sort of thing is basically the point of dlib. You can find all details on training and model specifics by reading the example program and consulting the referenced parts of dlib.  There is also a Python API for accessing the face recognition model.



[1] O. M. Parkhi, A. Vedaldi, A. Zisserman Deep Face Recognition British Machine Vision Conference, 2015.
[2] H.-W. Ng, S. Winkler. A data-driven approach to cleaning large face datasets. Proc. IEEE International Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014

415 comments :

«Oldest   ‹Older   401 – 415 of 415
Tsai Joy said...

Hi Andrey, yes it's Euclidean score not euler, my bad ;)
(Euclidean score = 1.0 - Euclidean distance)

As for the second question, yes there were not-sure regions (Euclidean score 60~70) where the face recognition(FR) had trouble giving correct results. Therefore, I previously skipped these frames completely, due to the inference is a video, and I have many frames of face images to use.

In short, Euclidean score below 60 was set as "Unconfidence" where the inference face is labeled "Unknown" (Unknown means not enrolled in database). In contrast, Euclidean score above 70 will be labeled as the enrolled name in database e.g. "WILL". For Euclidean score between 60~70, false results will occur, i.e. the FR will think it's someone else "John" when it's actually "WILL".

But all these measures are just a work around for Euclidean score, as the SVM score now can give a more definite "-1" as different faces and "+1" as same faces between inference and enrolled. I guess you can say high varianced SVM score (I normalized from -1~1 to 0.00~1.00) is better in FR application, due to differ the "higher known confidence" and "lower unknown Unconfidence".

Andrey Zakharoff said...

@Tsay Joy Hi Joy, I use formula Probability=sqrt(1.- Euclidean distance), also this is not real probability, but probability-like value. Actually,in this way I de-linearize output value. Due to the changed slope of the function I get stretched scores near 0. and shrinked where Euclidean distance coming to 1. Don't you think, your SVM does something like this?

Andrey Zakharoff said...

Please do not SPAM here!

Kasper van Zon said...

Hi Davis,

Thank you for providing this great library!
I would like to speed up the inference step of the network on the CPU by using Intel's OpenVino framework. To do this I would first need to convert your network into a format that the OpenVino model optimizer understands (e.g. Caffe). I have tried to use your "convert_dlib_nets_to_caffe" tool, but I ran into a problem.
The conversion from .dat to .xml did work, but when I ran the tool (./convert_dlib_nets_to_caffe DlibFaceNet.xml 1 3 150 150) it gave the following error:
*************** ERROR CONVERTING TO CAFFE ***************
No conversion between dlib pooling layer parameters and caffe pooling layer parameters found for layer 127
dlib_output_nc: 35
bottom_nc: 72
padding_x: 0
stride_x: 2
kernel_w: 3
pad_x: 1

Any suggestions on howto fix this? It would be great if we can make your network run on the new Intel Movidius Neural Compute Stick 2.

Cheers

Bill Klein said...

I'm attempting to use dnn_metric_learning_on_images_ex on multiple (4) GPUs for the first time. After much playing with the batch size and the number of data_loader threads, I can't seem to get the typical GPU usage to above ~30%. Any suggestions of what I should be looking at / modifying to keep the GPUs occupied? Thanks!

Davis King said...

The network in that example is probably too small to benefit from 4 GPUs.

Mike said...

Hi Davis,
you recommended the use of an SVM classifier after the DNN ("I would use http://dlib.net/ml.html#svm_c_linear_dcd_trainer").

The stated performance of 99.38% on the standard Labeled Faces in the Wild benchmark is achieved using the pure DNN with Euclidean distance measure and without any VCM behind it, correct?

Thanks, Michael

Mike said...

Sorry, I mean ...without any "SVM" behind it ...

Davis King said...

Right, the 99.38% accuracy is without any additional training applied. It's using just the DNN model by itself.

Mike said...

Hi Davis,
do you publish the code used for the Labeled Faces in the Wild benchmark in order for us to duplicate the result?

Davis King said...

The LFW test script is here: http://dlib.net/files/dlib_face_recognition_resnet_model_v1_lfw_test_scripts.tar.bz2

Sara said...

Hello Davis,

Thanks for your great work.
My dataset has 3M people with 60 images per person, Do I need to change the following numbers in code,Do you think the numbers seem to be reasonable?
Does these numbers related to batch size(which is 128 by default)?
BATCH_NUM_PERSONS = 64;
const unsigned BATCH_NUM_SAMPLES = 40

Thanks

Davis King said...

I have no idea what:

BATCH_NUM_PERSONS = 64;
const unsigned BATCH_NUM_SAMPLES = 40

is referring to in your code.

Sara said...

I am running "dnn_metric_learning_on_images_ex.cpp" file in dlib/examples.
Sorry, maybe I am using an older version. Those variables were defined in load_mini_batch class. I mean:

num_people = 64
samples_per_id = 40

Davis King said...

Those numbers are fine, the bigger the better usually, as long as your hardware has enough RAM to support such sizes. You should run experiments to see what works best though.

«Oldest ‹Older   401 – 415 of 415   Newer› Newest»