Sunday, February 12, 2017

High Quality Face Recognition with Deep Metric Learning

Since the last dlib release, I've been working on adding easy to use deep metric learning tooling to dlib. Deep metric learning is useful for a lot of things, but the most popular application is face recognition. So obviously I had to add a face recognition example program to dlib. The new example comes with pictures of bald Hollywood action heroes and uses the provided deep metric model to identify how many different people there are and which faces belong to each person. The input images are shown below along with the four automatically identified face clusters:




Just like all the other example dlib models, the pretrained model used by this example program is in the public domain. So you can use it for anything you want. Also, the model has an accuracy of 99.38% on the standard Labeled Faces in the Wild benchmark. This is comparable to other state-of-the-art models and means that, given two face images, it correctly predicts if the images are of the same person 99.38% of the time.

For those interested in the model details, this model is a ResNet network with 29 conv layers. It's essentially a version of the ResNet-34 network from the paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun with a few layers removed and the number of filters per layer reduced by half.

The network was trained from scratch on a dataset of about 3 million faces. This dataset is derived from a number of datasets. The face scrub dataset[2], the VGG dataset[1], and then a large number of images I personally scraped from the internet. I tried as best I could to clean up the combined dataset by removing labeling errors, which meant filtering out a lot of stuff from VGG. I did this by repeatedly training a face recognition model and then using graph clustering methods and a lot of manual review to clean up the dataset. In the end, about half the images are from VGG and face scrub. Also, the total number of individual identities in the dataset is 7485. I made sure to avoid overlap with identities in LFW so the LFW evaluation would be valid.

The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level. The training code is obviously also available, since that sort of thing is basically the point of dlib. You can find all details on training and model specifics by reading the example program and consulting the referenced parts of dlib.  There is also a Python API for accessing the face recognition model.



[1] O. M. Parkhi, A. Vedaldi, A. Zisserman Deep Face Recognition British Machine Vision Conference, 2015.
[2] H.-W. Ng, S. Winkler. A data-driven approach to cleaning large face datasets. Proc. IEEE International Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014

466 comments :

«Oldest   ‹Older   401 – 466 of 466
Andyrey said...

Please do not SPAM here!

Unknown said...

Hi Davis,

Thank you for providing this great library!
I would like to speed up the inference step of the network on the CPU by using Intel's OpenVino framework. To do this I would first need to convert your network into a format that the OpenVino model optimizer understands (e.g. Caffe). I have tried to use your "convert_dlib_nets_to_caffe" tool, but I ran into a problem.
The conversion from .dat to .xml did work, but when I ran the tool (./convert_dlib_nets_to_caffe DlibFaceNet.xml 1 3 150 150) it gave the following error:
*************** ERROR CONVERTING TO CAFFE ***************
No conversion between dlib pooling layer parameters and caffe pooling layer parameters found for layer 127
dlib_output_nc: 35
bottom_nc: 72
padding_x: 0
stride_x: 2
kernel_w: 3
pad_x: 1

Any suggestions on howto fix this? It would be great if we can make your network run on the new Intel Movidius Neural Compute Stick 2.

Cheers

Unknown said...

I'm attempting to use dnn_metric_learning_on_images_ex on multiple (4) GPUs for the first time. After much playing with the batch size and the number of data_loader threads, I can't seem to get the typical GPU usage to above ~30%. Any suggestions of what I should be looking at / modifying to keep the GPUs occupied? Thanks!

Davis King said...

The network in that example is probably too small to benefit from 4 GPUs.

Mike said...

Hi Davis,
you recommended the use of an SVM classifier after the DNN ("I would use http://dlib.net/ml.html#svm_c_linear_dcd_trainer").

The stated performance of 99.38% on the standard Labeled Faces in the Wild benchmark is achieved using the pure DNN with Euclidean distance measure and without any VCM behind it, correct?

Thanks, Michael

Mike said...

Sorry, I mean ...without any "SVM" behind it ...

Davis King said...

Right, the 99.38% accuracy is without any additional training applied. It's using just the DNN model by itself.

Mike said...

Hi Davis,
do you publish the code used for the Labeled Faces in the Wild benchmark in order for us to duplicate the result?

Davis King said...

The LFW test script is here: http://dlib.net/files/dlib_face_recognition_resnet_model_v1_lfw_test_scripts.tar.bz2

Sara said...

Hello Davis,

Thanks for your great work.
My dataset has 3M people with 60 images per person, Do I need to change the following numbers in code,Do you think the numbers seem to be reasonable?
Does these numbers related to batch size(which is 128 by default)?
BATCH_NUM_PERSONS = 64;
const unsigned BATCH_NUM_SAMPLES = 40

Thanks

Davis King said...

I have no idea what:

BATCH_NUM_PERSONS = 64;
const unsigned BATCH_NUM_SAMPLES = 40

is referring to in your code.

Sara said...

I am running "dnn_metric_learning_on_images_ex.cpp" file in dlib/examples.
Sorry, maybe I am using an older version. Those variables were defined in load_mini_batch class. I mean:

num_people = 64
samples_per_id = 40

Davis King said...

Those numbers are fine, the bigger the better usually, as long as your hardware has enough RAM to support such sizes. You should run experiments to see what works best though.

WilliamCorrea said...

Hello Davis.
I'm training my own face recognition model, testing different architectures and loss functions, and comparing them with pre-built models like yours. I came into the comment of your loss function << a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level >>, and I wonder what do you think of it when compared with the triplet-loss (or more recent ones such as https://arxiv.org/pdf/1801.07698.pdf), and if you had the time to empirically compare them. Thanks !

Davis King said...

I think it makes more sense than the triplet loss. I'm not sure how it compares to the other recent losses. I think sphereloss is sensible and basically motivated the same way. Really though you should optimize a loss that measures the performance of the model on the task you want to accomplish. The loss I used here in dlib optimizes the binary classification accuracy when using Euclidean distance between vectors to decide if pairs of faces are the same person. Other losses will be more appropriate for other tasks.

It should also be noted though that the quality and size of your training dataset is far and away the most important variable in making a good face recognition model. All your effort should be on that. Other things are micro optimizations.

KBN said...

Hi.

I did this ->
deserialize("mydatafile.dat") >> learned_pfunct_load_from_modelII;

It worked well.

I ported to a custom platform and could only access mydatafile.dat from the memory location. So how do I deserialize from memory? I tried
char * mydatafileloadedfrommemory; // points to my memory data location

deserialize(mydatafileloadedfrommemory);

It does not work.

Any pointer please?
Thanks.

Davis King said...

You can also always call deserialize(some_object, some_input_stream). So you just have to make an input stream with the data in it. There are many ways to do this.

KBN said...

Thanks David.

I did the following -

std::istream in(&sbuf); // where sbuf is streambuf.

So, how do I gt this 'some_object'?

deserialize(some_object, in); // <-- where and how do I define this 'some_object'?

Any pointer is much appreciated.

WilliamCorrea said...

Hello Davis,
Thank you for your last explanation about the loss used by dlib. For the sake of completeness, I would have some other questions.

Regarding the creation of the mini-batchs at training time, do you specify a number minimum of images per class (person) ? For instance, if the size of the mini-batch is 32, and I want minimum 4 images per person, I would have at most 8 different persons in that mini-batch ?

Regarding the loss, with a ball-radius of 0.6, would it be something like this the way you computed it ?:

- For two embedding of the same person
difference = np.array(embedding_1) - np.array(embedding_2)
norm = np.linalg.norm(difference, 2)
loss = max(0, norm - 0.6)

- For two embeddings of different persons:
difference = np.array(embedding_1) - np.array(embedding_3)
norm = np.linalg.norm(difference, 2)
loss = max(0, 0.6 - norm)

Thank you very much !

Davis King said...

I don't remember the exact details. I would have to look at the documentation to answer your question. So you might as well just skip that step and look at the documentation yourself ;)

Unknown said...

Hello Davis,

Firstly, I really thank you about amazing library.

I am fine-tuning the model "dlib_face_recognition_resnet_model_v1.dat" based on your code from file "dnn_metric_learning_on_images_ex.cpp"

I just tried to fine-tune with your data "examples/johns" and got the error "NAN" with average loss. Like following picture: https://imgur.com/a/ltVycVz

Could you please give me some advice about this problem?
Thank you very much.

Unknown said...

Hello Davis,
I found your comment on another post. https://sourceforge.net/p/dclib/discussion/442518/thread/40667e18/#70c4/83a9/a26d/0db3/232d/878a/5291/894f/2fb5

Thank you so much, it's really helpful for me.

Unknown said...

Hi,
What is the best way to represent the 128D face vec (std::vector>) in a manner that it could be stored on sql db.

And I wonder if it possible to run the face resnet on CPU only without GPU acceleration?

Thanks,
Ben

Unknown said...

There is a problem for me:
Fehler C1128 Die Anzahl von Abschnitten hat das Formatierungslimit der Objektdatei überschritten: Kompilieren mit /bigobj. ook1 C:\Users\M-Team\Desktop\ook1\ook1\ook1.cpp
what should i do?

Unknown said...

There is a problem for me:
The number of sections has exceeded the formatting limit of the object file

what should i do?

Anonymous said...

I've looked at the XML output of the part labels (from file examples/faces/training_with_face_landmarks.xml) and it seems to label parts with just numbers instead of names of parts:



<images>
<image file='2007_007763.jpg'>
<box top='90' left='194' width='37' height='37'>
<part name='00' x='201' y='107'/>
<part name='01' x='201' y='110'/>
<part name='02' x='201' y='113'/>
<part name='03' x='202' y='117'/>
...

I presume this means you have to be consist with the ordering?

And what if you are trying to do a multi-nominal image matcher, for things that have different types of interior parts? Then they would have the same numbers.

(Sorry the question is dumb... I'm just starting with image recognition. This is a pretty neat library!)

Davis King said...

The labels can be anything, they don't have to be numbers. But you have to be consistent about their meaning. Like you can't just randomly shuffle the labels on each training sample, obviously.

Things with different types of parts need different models.

Mike said...

Hi Davis, is there any path from CUDA acceleration to NEON acceleration? I would like to run dlib on an ARM with NEON, lacking CUDA.

Davis King said...

The deep learning stuff in dlib is accelerated only by using a BLAS library or CUDA.

Mike said...

Hi Davis,
thanks. So BLAS and CUDA are mutually exclusive, correct?
If compiled with CUDA enabled, BLAS ist not used and vice versa?

Andyrey said...

Yes, BLAS is for CPU, not GPU.

Andyrey said...
This comment has been removed by the author.
Andyrey said...
This comment has been removed by the author.
Andyrey said...

Off-topic is also undesirable.

Anonymous said...

Hi Davis,

On the LFW website dlib is stated as having used the "Unrestricted" and "Labeled Outside Data" options. This, http://vis-www.cs.umass.edu/lfw/README.txt, as you know, is the benchmark recipe. The benchmark for the "Unrestricted" option requires a people.txt file. However, downloading http://dlib.net/files/dlib_face_recognition_resnet_model_v1_lfw_test_scripts.tar.bz2 I can't find the people.txt file.
So my question is which option you used? Did the LFW people put dlibs result in the wrong category or did you use a people.txt file but I failed to find it? If so, please direct me to it.

Thank you for your work!

Best,
Jesper

Davis King said...

The people.txt file is provided in the LFW dataset and is part of their training data. It is not a model output and doesn't have anything to do with how a model is evaluated. It's just something you can use to make training data from the LFW dataset if you are are going to train a model from only the LFW data.

Unknown said...

may i use it for commercial purpose ?

Neeraj Kerkar said...

Has anyone evaluated this model on the IJB-A dataset face verification task?

Mike said...

Hello Davis,
I was wondering if anybody has tried porting your trained net to bolt for fast execution on mobile devices?
https://github.com/huawei-noah/bolt
Thanks.

Mike said...

Hi Davis, I learned that OpenCV now supports Tengine for inference on ARM. Wouldn't that be a nice feature for dlib, as we see more and more ARM devices coming and CUDA seems to be maxed out.

Davis King said...

Yeah that would be cool. Someone should do that and submit a pull request :)

Mahesh Bisht said...

Sir i didnt find and documentations. please help me

Unknown said...

can i use this dlib library in japan country.?

Unknown said...

i am using make app face matching . but does not work in japan ?

Andyrey said...

As far as training was held on general datasets by Davis, I suppose, there is a bias to europium (Caucasoid ) type, and I would specify the question:
- if I want to distinct among special group of people, such as Asian, Black, or children, is there a tool for fine tuning from the original dlib model? This is very actual question for many.

Mike said...

Hi Davis,
do you have any plans to support "blasfeo" instead of OpenBLAS?

Davis King said...

No current plans to expand to other BLAS like APIs. Note however that dlib will already work with any BLAS and LAPACK libraries that provide the standard CBLAS API and typical LAPACK link symbols.

Mike said...

I have profiled the dlib facial recognition example with perf. There are 3 hot spots:
´´´
49.03% A20_Face A20_Face [.] sgemm_kernel_L4_M4_22
12.75% A20_Face A20_Face [.] sgemm_ncopy_L4_M4_20
11.96% A20_Face A20_Face [.] dlib::cpu::img2col
´´´
I believe the first 2 functions are covered by OpenBLAS, the third is a dlib function.
Any recommendation regarding speedup on ARM CPU?

Davis King said...

There isn't any special stuff for ARM in dlib's DNN tooling. The DNN tooling is really targeted at platforms that have cuDNN. Everything else will be fairly slow.

Victor said...

Hi Davis, fist I want to say Thanks so much for your dlib library, help me a lot for do my scripts, Also I want to ask You about the http://dlib.net/files/dlib_face_recognition_resnet_model_v1_lfw_test_scripts.tar.bz2 I’m trying to run this test doing your read.txt and trying to run the ./main, but I don’t getting the rest of the results, it’s only cout the disth thresh: 0,6 and margin: 0.04
And don’t show anything else, it’s like the code just stop running... please help me to finish this! Thanks a lot

Victor said...
This comment has been removed by the author.
anszom said...

I've looked at the metric_loss code and I wonder why the loss function is defined so "softly" so to speak. Any distance less than 0.6 will result in zero loss for members of the same class, the same for any distance above 0.6 for members of different classes. There is no incentive for the network to make a more compact representation for each class, so even if the network is trained "perfectly", any random variations can push the result over the threshold.
Would it be a good idea to have a separate distance threshold for positive cases (for example 0.3), lower than the distance threshold for negative cases?

Davis King said...

It's just a classic hinge loss. https://en.wikipedia.org/wiki/Hinge_loss. Except with 0.6 instead of 1.0. The value there is smaller because it worked better with typical SGD and weight decay settings, which have a hard time making larger network outputs.

Random variations can't push it over a threshold due to the margin not being 0 in the loss. You aren't reading the loss function right since if I'm understanding you you are saying the margin looks like it's 0. It's not 0. This is a class max-margin style loss.

Davis King said...

Er, I forgot what the values are. The default margin is 0.04 instead of 1.0. But that numerical detail doesn't matter. It's a max margin loss for any number > 0 in that part of the equation.

june said...

Hello, Does dlib recognition have a paper on the ROC curve on the LFW benchmark and prediction times, same on this paper for comparison to other techniques. https://www.cs.cmu.edu/~satya/docdir/CMU-CS-16-118.pdf

Thank you

Davis King said...

No, but the test can be rerun using http://dlib.net/files/dlib_face_recognition_resnet_model_v1_lfw_test_scripts.tar.bz2, reproducing the dlib results noted here.

june said...

Hi Davis, okay thank you for your quick response, thanks again for your great work in dlib.

Unknown said...

Hello, If I want to add more faces to the library would it be necessary to retrain it? Do you think this library would be accurate to compare faces from a security camera frame with let's say a prison database picture? Or should I do some pre treatment on the frame to enhance the image quality? Thanks in advance.

Andyrey said...

@Unknown: zest of this method is that you don't need to train the model for new faces. It is trained only to compare two faces and give the answer- is it the same person or not. But some ethnic bias is possible.
Pre-treatment is necessary- the face should be large enough and not be smoothed. Better if you can use some statistics to get more reliable recognition.

Unknown said...

@Andyrey Thank you for your Answer!

Achmad Rifki said...

Hello Davis, thanks for your works, such a great library.

so, i came into comment of your model details For those interested in the model details, this model is a ResNet network with 29 conv layers. It's essentially a version of the ResNet-34 network from the paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun with a few layers removed and the number of filters per layer reduced by half.

i want to ask you, which layer did you remove from 34 conv layers to 29 conv layers, would you mind showing us the details of which layers removed? Thank you

Achmad Rifki said...
This comment has been removed by the author.
Achmad Rifki said...

i think the details of the 29 conv layer are in this file dnn_metric_learning_on_images_ex.cpp right?

Davis King said...

Yep, that example program has the details.

Linko994 said...

I am concerned about the sentence "the pretrained model used by this example program is in the public domain. So you can use it for anything you want."

You used VGGFace and Face Scrub to train the model. And at least face scrub is definitely under a non-commercial license. So how can the resulting model be "used for everything you want"?

Davis King said...

I am not making any claim of rights to the work I did here. As for what others might claim about datasets, I have no idea. You should talk to a lawyer if you have concerns.

«Oldest ‹Older   401 – 466 of 466   Newer› Newest»