Sunday, February 12, 2017

High Quality Face Recognition with Deep Metric Learning

Since the last dlib release, I've been working on adding easy to use deep metric learning tooling to dlib. Deep metric learning is useful for a lot of things, but the most popular application is face recognition. So obviously I had to add a face recognition example program to dlib. The new example comes with pictures of bald Hollywood action heroes and uses the provided deep metric model to identify how many different people there are and which faces belong to each person. The input images are shown below along with the four automatically identified face clusters:




Just like all the other example dlib models, the pretrained model used by this example program is in the public domain. So you can use it for anything you want. Also, the model has an accuracy of 99.38% on the standard Labeled Faces in the Wild benchmark. This is comparable to other state-of-the-art models and means that, given two face images, it correctly predicts if the images are of the same person 99.38% of the time.

For those interested in the model details, this model is a ResNet network with 27 conv layers. It's essentially a version of the ResNet-34 network from the paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun with a few layers removed and the number of filters per layer reduced by half.

The network was trained from scratch on a dataset of about 3 million faces. This dataset is derived from a number of datasets. The face scrub dataset[2], the VGG dataset[1], and then a large number of images I personally scraped from the internet. I tried as best I could to clean up the combined dataset by removing labeling errors, which meant filtering out a lot of stuff from VGG. I did this by repeatedly training a face recognition model and then using graph clustering methods and a lot of manual review to clean up the dataset. In the end, about half the images are from VGG and face scrub. Also, the total number of individual identities in the dataset is 7485. I made sure to avoid overlap with identities in LFW so the LFW evaluation would be valid.

The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level. The training code is obviously also available, since that sort of thing is basically the point of dlib. You can find all details on training and model specifics by reading the example program and consulting the referenced parts of dlib.  There is also a Python API for accessing the face recognition model.



[1] O. M. Parkhi, A. Vedaldi, A. Zisserman Deep Face Recognition British Machine Vision Conference, 2015.
[2] H.-W. Ng, S. Winkler. A data-driven approach to cleaning large face datasets. Proc. IEEE International Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014

53 comments :

Mohamed Ikbel Boulabiar said...

Can it detects if someone is not in the base?
Detecting unknown people is a problem in another library with no way to say if a face is not in the labelled faces base.

Davis King said...

Yes. At the end of the day, this is a classifier that tells you if two images are of the same person. Half its job is to say "no" when they aren't.

Kyle McDonald said...

Could you say a little more about what "graph clustering methods" you used here? I'm interested in using this on a dataset to cluster unknown identities. Right now I have a few ideas: 1.) just to k-means, 2.) do the n^2 comparisons, then do k-means on those rows, 3.( take each face and compare it to the n-1 others, assign it to the best match, and then at the end group all the faces that are part of the same set (don't know if there's a name for #2 or #3...)

Davis King said...

The one you probably want to use is the one in the example program, the "Chinese Whispers" algorithm. The paper describing the method is referenced in the dlib documentation. It's a really simple iterative graph neighbor relabeling algorithm that gives surprisingly good results. It's what made the 4 clusters in this example. You don't even tell it how many clusters there are.

There are also graph clustering methods like modularity clustering, which is also in dlib, but I've found on many problems that a simple method like Chinese whispers gives better results. Which is surprising considering how theoretically well motivated modularity clustering is.

As for what else I did to clean up the data. I would sort pairs of identities by how similar their average similarity was. That helped find cases where the same person appeared under two names. Then I would also sort all the images for a given person by how close they were to the centroid of their class. If you then look at that sorted list you can see obvious labeling errors accumulate at the end and remove them. There were a bunch of other minor variations on that kind of theme with a bunch of manual review. A LOT of manual review.

Kyle McDonald said...

Thanks! I just looked into the Chinese whispers algorithm. It feels like a graphical version of the k-medoids algorithm, except you're changing the assignments of each item instead of changing the medoid assignment. It makes sense to me that it would converge on something useful if the initialization is good, but I would expect it to have similar problems as k-means where bad initialization can cause degenerate assignments. I'll run it a few times and look for the best results :)

Davis King said...

You will be surprised. It's very good considering it's a really simple method. I'm still slightly mystified that it's better than modularity clustering but what's always been my experience.

I've also found that that the random initialization is irrelevant. It always seems to converge to something pretty sensible. The only thing I can say that's bad, aside from the name being maybe slightly racist, is that sometimes I've found it useful to do some kind of post processing to clean up the results. e.g. looking at clusters and checking if any of them have a lot of edges between them and merging them after the fact. But usually it's pretty good.

ngap wei Tham said...

The comments of cpp example mentioned

"This model has a 99.38% accuracy on the standard LFW face recognition benchmark, which is comparable to other state-of-the-art methods for face recognition as of February 2017."

But this post said

"given two face images, it correctly predicts if the images are of the same person 99.38% of the time."

It sound more like verification(A equal to B) rather than recognition(Who is A?). 99.38% accuracy is verification nor recognition?

Davis King said...

It's 99.38% according to the LFW evaluation protocol. Complain to the LFW people about the choice of words if you don't like it.

钟华平 said...

I used the code in python_examples/face_recognition.py to get descriptors for two given face images and then calculate the cosine similarity between these two 128D descriptors so as to verify whether these two face images are from the same person. However, I found that although the input images are not from the same person, the similarity will be very high (greater than 0.9). Actually, I used the images from LFW to verify the code.

钟华平 said...
This comment has been removed by the author.
Davis King said...

As the example says, Use Euclidean distance, not cosine similarity.

florisdesmedt said...

Another great extention of the dlib library! Is there a reason the CPU HOG-based frontal face detector is used instead of the (more accurate) dnn version (except training a model for only frontal faces)?

Best regards

Davis King said...

Thanks. No reason other than the HOG detector is faster.

ngap wei Tham said...

>The network was trained from scratch on a dataset of about 3 million faces

Thanks for the model and nice example.
Is is possible to make the dataset public available?

Davis King said...

I'm probably not going to post the data as it's a big dataset and I don't want to deal with hosting it. Also, the Microsoft celeb-1M dataset is out now which is bigger than mine anyway. So you might as well get that dataset instead.

gaurav gupta said...

How is it compared to betaface ?
https://www.betaface.com/wpa/

Davis King said...

I have no idea, do they post their accuracy on the LFW benchmark? I posted my LFW accuracy, so you can use that to compare against other tools.

Davis King said...

Turns out betaface has their accuracy listed on the LFW results page (http://vis-www.cs.umass.edu/lfw/results.html). It's only 98.08% apparently.

gaurav gupta said...

I tried using dlib face detection in a bit blurred image. Couldn't find any results. But betaface detected the face in the same image. Is there any preprocessing required?

Davis King said...

Maybe the face is too small and you need to make the image bigger, I don't know.

Davis King said...

You could also always try this detector (http://blog.dlib.net/2016/10/easily-create-high-quality-object.html) instead of the one used in the face recognition example program.

richardliao said...

I have tried to use dlib to detect anime faces but only work less than 50% of the time. Is there anyway I can twist the code to do so without going through manual labeling and retraining models? Thanks!

Davis King said...

I doubt it. I would train a detector. It's pretty easy to do.

Kasper van Zon said...

I would like to play around with this Face Recognition network in combination with the OpenCV VideoCapture. The images from OpenCV (dlib::cv_image) are however in bgr pixel format and I am assuming that the face network is trained with rgb images. Would it make a big difference if I feed the network bgr images? Or does dlib have an efficient routine to convert from bgr to rgb?

Davis King said...

The images need to be RGB. If you are using C++ pretty much any way to convert the image is fine. I don't know what's a sensible method in Python.

Kasper van Zon said...

Thank you for the information, and for making your awesome library publicly available!

Converting the images in C++ is indeed relative easy. I was just checking if there wasn't already something like a SIMD optimized pixel conversion routine in dlib.

Davis King said...

No problem :)

You can also make a new input layer that reads directly from an OpenCV image if you feel the need. It's easy to do since the input layer interface you have to implement is fully documented: http://dlib.net/dlib/dnn/input_abstract.h.html#EXAMPLE_INPUT_LAYER

Daniel Sáez said...

Do you have any reference to the structure metric loss that you used? Thanks!

Davis King said...

The loss is described in the loss_metric_ documentation. However, I don't have a reference paper for it.

Anirud Thyagharajan said...

Adding to what Mohamed Ikbel asked, would it not be required to train the network again for the task of face verification of some faces of some identities that were not present in the dataset?

This is a brilliant piece of code, giving the power to change metric functions as well, Kudos for that.

I'm also interested as to how to approach fine tuning the pretrained net, are there any APIs present for that? Thanks!

Yatong Zhang said...

Can you share the images you trained the model?

Davis King said...

No, it's not required to retrain. The model posted wasn't trained on any of the faces/identities in LFW for example. The whole point of this type of model is that you don't need to do that kind of target specific training, which is why metric learning style algorithms are so popular for face recognition and verification right now. That's not to say that you don't, as a post processing step, combine some kind of target specific SVM or something that operates on top of the metric learning algorithm. People sometimes do that and it can improve verification. But you can also just do k-nearest-neighbors as your verification algorithm and that is pretty good too. Many things are possible. But in any case, no, you don't retrain the metric learning part.

Although, if you want to retrain or fine tune or do anything like that the API is fully documented. There are introduction examples to the DNN API as well as a full API reference. http://dlib.net/faq.html#Whereisthedocumentationforobjectfunction

As for training data, as I said before: I'm probably not going to post the data as it's a big dataset and I don't want to deal with hosting it. Also, the Microsoft celeb-1M dataset is out now which is bigger than mine anyway. So you might as well get that dataset instead.

Anirud Thyagharajan said...

Ah, I see. Thank you so much for your comprehensive reply. I will try it out for other image sets.

I tried it for the example file given in faces/2007_007763.jpg in the examples folder of the dlib Github repository, but the clustering didn't quite turn out correct. Is there any kind of preprocessing required for this to work out? Also, is there any necessity for more images of the same identity to be present for the clustering to work?

Davis King said...

Nothing is perfect. The examples are what they are. What is best for any application depends on the details and computer vision and machine learning is complex. I can always find some additional thing to do or change to some standard technique that makes something more or less applicable to any given problem.

Anirud Thyagharajan said...

Very true, this could be a specific outlier.

Thank you so much for your time and effort in replying to me, very much appreciated, and a great tool it is indeed!

Davis King said...

No problem :)

Christian Otto said...

I'm wondering if I did something wrong when compiling the dnn_face_recognition_ex.cpp since it appears to be very slow (it runs about 7 mins). Does it make use of the GPU? Do I have to enable something for it to do so?

Davis King said...

It will use CUDA and cuDNN if you have it installed. Also, are you using visual studio? http://dlib.net/faq.html#Whyisdlibslow

Christian Otto said...

No I'm on ubuntu and just build with CMake. Does the cudnn version matter? And does it use cuda if it is not installed in the standard location? Thanks!

Davis King said...

The CMake output tells you what is happening. There are big obvious messages that say things about CUDA and cuDNN, telling you what it's doing.

Christian Otto said...

Oh i never actually used the provided CMakeLists file. It told me that I was using cudnn version 4, which was wrong. It's about 10 times faster now, thank you Davis.

ARBaboon said...

Interestingly it seems dlib_face_recognition_resnet_model_v1 has poor dynamic range for 25 to 40 year old african-americans, tested with a dataset containing 200 people.

Davis King said...

Yeah, there is definitely some dataset bias. The training data I have, along with LFW, is definitely biased towards white guys in the sense that they are overrepresented in the data. I spent a while trying to gather non-white people for the training dataset to improve it but it's still somewhat biased.

Leugim said...

Interesting, I was going to try to improve OpenFace with a data set I recently crawled over the web.

Can I ask why you don't augment your data via random colour channel shift? Unless I'm mistaken, but I can't see doing that.

Also, why have you decided to prove this a python interface but your DNN face detector?

mphielipp said...

Awesome new functionality!! Thank you Davis!

Any suggestions if I want to create a dll to use this in C#?

Davis King said...

The training data was augmented with random color shifts.

I made a Python interface because people asked for it. I didn't make it for other things because less people asked/I'm busy/don't feel like it.


In general my advice for calling C++ from C# (or java) is to use SWIG, which I've found to be very convenient.

bubi said...
This comment has been removed by the author.
bubi said...

David, thank you very much for this great work. Just a simple but intriguing question: Have you used a person with different gender for hard-negative mining at the mini-batch level? Meaning a = female, p = female, n = male, or viceversa?

Davis King said...

Each mini-batch includes a mix of genders. So yes.

Luke said...

This seems so neat! Could a Python example and API be coming in the near future?

Davis King said...

There is a python example, it's discussed in this blog post.

Nithish Chauhan said...

Sir You are awesome! and dlib Library too . I really like your Dlib Library it helped me a lot .

I am working in image & video analytics team as a researcher in a company . I have around 2 year programming experience in C++ . Sir how to start writing codes such as Dlib . I really like your C++ codes and they do wonders .I find sometimes difficult to write classes that are usable in C++ . I really need your guidance like where to start and how to improve code on a daily basis. Thanks in Advance.

Davis King said...

You should study by reading books: http://dlib.net/books.html. That is the best way to get started. Anyone who tells you otherwise is leading you astray.