Sunday, February 12, 2017

High Quality Face Recognition with Deep Metric Learning

Since the last dlib release, I've been working on adding easy to use deep metric learning tooling to dlib. Deep metric learning is useful for a lot of things, but the most popular application is face recognition. So obviously I had to add a face recognition example program to dlib. The new example comes with pictures of bald Hollywood action heroes and uses the provided deep metric model to identify how many different people there are and which faces belong to each person. The input images are shown below along with the four automatically identified face clusters:




Just like all the other example dlib models, the pretrained model used by this example program is in the public domain. So you can use it for anything you want. Also, the model has an accuracy of 99.38% on the standard Labeled Faces in the Wild benchmark. This is comparable to other state-of-the-art models and means that, given two face images, it correctly predicts if the images are of the same person 99.38% of the time.

For those interested in the model details, this model is a ResNet network with 27 conv layers. It's essentially a version of the ResNet-34 network from the paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun with a few layers removed and the number of filters per layer reduced by half.

The network was trained from scratch on a dataset of about 3 million faces. This dataset is derived from a number of datasets. The face scrub dataset[2], the VGG dataset[1], and then a large number of images I personally scraped from the internet. I tried as best I could to clean up the combined dataset by removing labeling errors, which meant filtering out a lot of stuff from VGG. I did this by repeatedly training a face recognition model and then using graph clustering methods and a lot of manual review to clean up the dataset. In the end, about half the images are from VGG and face scrub. Also, the total number of individual identities in the dataset is 7485. I made sure to avoid overlap with identities in LFW so the LFW evaluation would be valid.

The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level. The training code is obviously also available, since that sort of thing is basically the point of dlib. You can find all details on training and model specifics by reading the example program and consulting the referenced parts of dlib.  There is also a Python API for accessing the face recognition model.



[1] O. M. Parkhi, A. Vedaldi, A. Zisserman Deep Face Recognition British Machine Vision Conference, 2015.
[2] H.-W. Ng, S. Winkler. A data-driven approach to cleaning large face datasets. Proc. IEEE International Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014

23 comments :

Mohamed Ikbel Boulabiar said...

Can it detects if someone is not in the base?
Detecting unknown people is a problem in another library with no way to say if a face is not in the labelled faces base.

Davis King said...

Yes. At the end of the day, this is a classifier that tells you if two images are of the same person. Half its job is to say "no" when they aren't.

Kyle McDonald said...

Could you say a little more about what "graph clustering methods" you used here? I'm interested in using this on a dataset to cluster unknown identities. Right now I have a few ideas: 1.) just to k-means, 2.) do the n^2 comparisons, then do k-means on those rows, 3.( take each face and compare it to the n-1 others, assign it to the best match, and then at the end group all the faces that are part of the same set (don't know if there's a name for #2 or #3...)

Davis King said...

The one you probably want to use is the one in the example program, the "Chinese Whispers" algorithm. The paper describing the method is referenced in the dlib documentation. It's a really simple iterative graph neighbor relabeling algorithm that gives surprisingly good results. It's what made the 4 clusters in this example. You don't even tell it how many clusters there are.

There are also graph clustering methods like modularity clustering, which is also in dlib, but I've found on many problems that a simple method like Chinese whispers gives better results. Which is surprising considering how theoretically well motivated modularity clustering is.

As for what else I did to clean up the data. I would sort pairs of identities by how similar their average similarity was. That helped find cases where the same person appeared under two names. Then I would also sort all the images for a given person by how close they were to the centroid of their class. If you then look at that sorted list you can see obvious labeling errors accumulate at the end and remove them. There were a bunch of other minor variations on that kind of theme with a bunch of manual review. A LOT of manual review.

Kyle McDonald said...

Thanks! I just looked into the Chinese whispers algorithm. It feels like a graphical version of the k-medoids algorithm, except you're changing the assignments of each item instead of changing the medoid assignment. It makes sense to me that it would converge on something useful if the initialization is good, but I would expect it to have similar problems as k-means where bad initialization can cause degenerate assignments. I'll run it a few times and look for the best results :)

Davis King said...

You will be surprised. It's very good considering it's a really simple method. I'm still slightly mystified that it's better than modularity clustering but what's always been my experience.

I've also found that that the random initialization is irrelevant. It always seems to converge to something pretty sensible. The only thing I can say that's bad, aside from the name being maybe slightly racist, is that sometimes I've found it useful to do some kind of post processing to clean up the results. e.g. looking at clusters and checking if any of them have a lot of edges between them and merging them after the fact. But usually it's pretty good.

ngap wei Tham said...

The comments of cpp example mentioned

"This model has a 99.38% accuracy on the standard LFW face recognition benchmark, which is comparable to other state-of-the-art methods for face recognition as of February 2017."

But this post said

"given two face images, it correctly predicts if the images are of the same person 99.38% of the time."

It sound more like verification(A equal to B) rather than recognition(Who is A?). 99.38% accuracy is verification nor recognition?

Davis King said...

It's 99.38% according to the LFW evaluation protocol. Complain to the LFW people about the choice of words if you don't like it.

钟华平 said...

I used the code in python_examples/face_recognition.py to get descriptors for two given face images and then calculate the cosine similarity between these two 128D descriptors so as to verify whether these two face images are from the same person. However, I found that although the input images are not from the same person, the similarity will be very high (greater than 0.9). Actually, I used the images from LFW to verify the code.

钟华平 said...
This comment has been removed by the author.
Davis King said...

As the example says, Use Euclidean distance, not cosine similarity.

florisdesmedt said...

Another great extention of the dlib library! Is there a reason the CPU HOG-based frontal face detector is used instead of the (more accurate) dnn version (except training a model for only frontal faces)?

Best regards

Davis King said...

Thanks. No reason other than the HOG detector is faster.

ngap wei Tham said...

>The network was trained from scratch on a dataset of about 3 million faces

Thanks for the model and nice example.
Is is possible to make the dataset public available?

Davis King said...

I'm probably not going to post the data as it's a big dataset and I don't want to deal with hosting it. Also, the Microsoft celeb-1M dataset is out now which is bigger than mine anyway. So you might as well get that dataset instead.

gaurav gupta said...

How is it compared to betaface ?
https://www.betaface.com/wpa/

Davis King said...

I have no idea, do they post their accuracy on the LFW benchmark? I posted my LFW accuracy, so you can use that to compare against other tools.

Davis King said...

Turns out betaface has their accuracy listed on the LFW results page (http://vis-www.cs.umass.edu/lfw/results.html). It's only 98.08% apparently.

gaurav gupta said...

I tried using dlib face detection in a bit blurred image. Couldn't find any results. But betaface detected the face in the same image. Is there any preprocessing required?

Davis King said...

Maybe the face is too small and you need to make the image bigger, I don't know.

Davis King said...

You could also always try this detector (http://blog.dlib.net/2016/10/easily-create-high-quality-object.html) instead of the one used in the face recognition example program.

richardliao said...

I have tried to use dlib to detect anime faces but only work less than 50% of the time. Is there anyway I can twist the code to do so without going through manual labeling and retraining models? Thanks!

Davis King said...

I doubt it. I would train a detector. It's pretty easy to do.