Sunday, February 12, 2017

High Quality Face Recognition with Deep Metric Learning

Since the last dlib release, I've been working on adding easy to use deep metric learning tooling to dlib. Deep metric learning is useful for a lot of things, but the most popular application is face recognition. So obviously I had to add a face recognition example program to dlib. The new example comes with pictures of bald Hollywood action heroes and uses the provided deep metric model to identify how many different people there are and which faces belong to each person. The input images are shown below along with the four automatically identified face clusters:




Just like all the other example dlib models, the pretrained model used by this example program is in the public domain. So you can use it for anything you want. Also, the model has an accuracy of 99.38% on the standard Labeled Faces in the Wild benchmark. This is comparable to other state-of-the-art models and means that, given two face images, it correctly predicts if the images are of the same person 99.38% of the time.

For those interested in the model details, this model is a ResNet network with 29 conv layers. It's essentially a version of the ResNet-34 network from the paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun with a few layers removed and the number of filters per layer reduced by half.

The network was trained from scratch on a dataset of about 3 million faces. This dataset is derived from a number of datasets. The face scrub dataset[2], the VGG dataset[1], and then a large number of images I personally scraped from the internet. I tried as best I could to clean up the combined dataset by removing labeling errors, which meant filtering out a lot of stuff from VGG. I did this by repeatedly training a face recognition model and then using graph clustering methods and a lot of manual review to clean up the dataset. In the end, about half the images are from VGG and face scrub. Also, the total number of individual identities in the dataset is 7485. I made sure to avoid overlap with identities in LFW so the LFW evaluation would be valid.

The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level. The training code is obviously also available, since that sort of thing is basically the point of dlib. You can find all details on training and model specifics by reading the example program and consulting the referenced parts of dlib.  There is also a Python API for accessing the face recognition model.



[1] O. M. Parkhi, A. Vedaldi, A. Zisserman Deep Face Recognition British Machine Vision Conference, 2015.
[2] H.-W. Ng, S. Winkler. A data-driven approach to cleaning large face datasets. Proc. IEEE International Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014

150 comments :

Mohamed Ikbel Boulabiar said...

Can it detects if someone is not in the base?
Detecting unknown people is a problem in another library with no way to say if a face is not in the labelled faces base.

Davis King said...

Yes. At the end of the day, this is a classifier that tells you if two images are of the same person. Half its job is to say "no" when they aren't.

Kyle McDonald said...

Could you say a little more about what "graph clustering methods" you used here? I'm interested in using this on a dataset to cluster unknown identities. Right now I have a few ideas: 1.) just to k-means, 2.) do the n^2 comparisons, then do k-means on those rows, 3.( take each face and compare it to the n-1 others, assign it to the best match, and then at the end group all the faces that are part of the same set (don't know if there's a name for #2 or #3...)

Davis King said...

The one you probably want to use is the one in the example program, the "Chinese Whispers" algorithm. The paper describing the method is referenced in the dlib documentation. It's a really simple iterative graph neighbor relabeling algorithm that gives surprisingly good results. It's what made the 4 clusters in this example. You don't even tell it how many clusters there are.

There are also graph clustering methods like modularity clustering, which is also in dlib, but I've found on many problems that a simple method like Chinese whispers gives better results. Which is surprising considering how theoretically well motivated modularity clustering is.

As for what else I did to clean up the data. I would sort pairs of identities by how similar their average similarity was. That helped find cases where the same person appeared under two names. Then I would also sort all the images for a given person by how close they were to the centroid of their class. If you then look at that sorted list you can see obvious labeling errors accumulate at the end and remove them. There were a bunch of other minor variations on that kind of theme with a bunch of manual review. A LOT of manual review.

Kyle McDonald said...

Thanks! I just looked into the Chinese whispers algorithm. It feels like a graphical version of the k-medoids algorithm, except you're changing the assignments of each item instead of changing the medoid assignment. It makes sense to me that it would converge on something useful if the initialization is good, but I would expect it to have similar problems as k-means where bad initialization can cause degenerate assignments. I'll run it a few times and look for the best results :)

Davis King said...

You will be surprised. It's very good considering it's a really simple method. I'm still slightly mystified that it's better than modularity clustering but what's always been my experience.

I've also found that that the random initialization is irrelevant. It always seems to converge to something pretty sensible. The only thing I can say that's bad, aside from the name being maybe slightly racist, is that sometimes I've found it useful to do some kind of post processing to clean up the results. e.g. looking at clusters and checking if any of them have a lot of edges between them and merging them after the fact. But usually it's pretty good.

ngap wei Tham said...

The comments of cpp example mentioned

"This model has a 99.38% accuracy on the standard LFW face recognition benchmark, which is comparable to other state-of-the-art methods for face recognition as of February 2017."

But this post said

"given two face images, it correctly predicts if the images are of the same person 99.38% of the time."

It sound more like verification(A equal to B) rather than recognition(Who is A?). 99.38% accuracy is verification nor recognition?

Davis King said...

It's 99.38% according to the LFW evaluation protocol. Complain to the LFW people about the choice of words if you don't like it.

钟华平 said...

I used the code in python_examples/face_recognition.py to get descriptors for two given face images and then calculate the cosine similarity between these two 128D descriptors so as to verify whether these two face images are from the same person. However, I found that although the input images are not from the same person, the similarity will be very high (greater than 0.9). Actually, I used the images from LFW to verify the code.

钟华平 said...
This comment has been removed by the author.
Davis King said...

As the example says, Use Euclidean distance, not cosine similarity.

florisdesmedt said...

Another great extention of the dlib library! Is there a reason the CPU HOG-based frontal face detector is used instead of the (more accurate) dnn version (except training a model for only frontal faces)?

Best regards

Davis King said...

Thanks. No reason other than the HOG detector is faster.

ngap wei Tham said...

>The network was trained from scratch on a dataset of about 3 million faces

Thanks for the model and nice example.
Is is possible to make the dataset public available?

Davis King said...

I'm probably not going to post the data as it's a big dataset and I don't want to deal with hosting it. Also, the Microsoft celeb-1M dataset is out now which is bigger than mine anyway. So you might as well get that dataset instead.

gaurav gupta said...

How is it compared to betaface ?
https://www.betaface.com/wpa/

Davis King said...

I have no idea, do they post their accuracy on the LFW benchmark? I posted my LFW accuracy, so you can use that to compare against other tools.

Davis King said...

Turns out betaface has their accuracy listed on the LFW results page (http://vis-www.cs.umass.edu/lfw/results.html). It's only 98.08% apparently.

gaurav gupta said...

I tried using dlib face detection in a bit blurred image. Couldn't find any results. But betaface detected the face in the same image. Is there any preprocessing required?

Davis King said...

Maybe the face is too small and you need to make the image bigger, I don't know.

Davis King said...

You could also always try this detector (http://blog.dlib.net/2016/10/easily-create-high-quality-object.html) instead of the one used in the face recognition example program.

richardliao said...

I have tried to use dlib to detect anime faces but only work less than 50% of the time. Is there anyway I can twist the code to do so without going through manual labeling and retraining models? Thanks!

Davis King said...

I doubt it. I would train a detector. It's pretty easy to do.

Kasper van Zon said...

I would like to play around with this Face Recognition network in combination with the OpenCV VideoCapture. The images from OpenCV (dlib::cv_image) are however in bgr pixel format and I am assuming that the face network is trained with rgb images. Would it make a big difference if I feed the network bgr images? Or does dlib have an efficient routine to convert from bgr to rgb?

Davis King said...

The images need to be RGB. If you are using C++ pretty much any way to convert the image is fine. I don't know what's a sensible method in Python.

Kasper van Zon said...

Thank you for the information, and for making your awesome library publicly available!

Converting the images in C++ is indeed relative easy. I was just checking if there wasn't already something like a SIMD optimized pixel conversion routine in dlib.

Davis King said...

No problem :)

You can also make a new input layer that reads directly from an OpenCV image if you feel the need. It's easy to do since the input layer interface you have to implement is fully documented: http://dlib.net/dlib/dnn/input_abstract.h.html#EXAMPLE_INPUT_LAYER

Daniel Sáez said...

Do you have any reference to the structure metric loss that you used? Thanks!

Davis King said...

The loss is described in the loss_metric_ documentation. However, I don't have a reference paper for it.

Anirud Thyagharajan said...

Adding to what Mohamed Ikbel asked, would it not be required to train the network again for the task of face verification of some faces of some identities that were not present in the dataset?

This is a brilliant piece of code, giving the power to change metric functions as well, Kudos for that.

I'm also interested as to how to approach fine tuning the pretrained net, are there any APIs present for that? Thanks!

Yatong Zhang said...

Can you share the images you trained the model?

Davis King said...

No, it's not required to retrain. The model posted wasn't trained on any of the faces/identities in LFW for example. The whole point of this type of model is that you don't need to do that kind of target specific training, which is why metric learning style algorithms are so popular for face recognition and verification right now. That's not to say that you don't, as a post processing step, combine some kind of target specific SVM or something that operates on top of the metric learning algorithm. People sometimes do that and it can improve verification. But you can also just do k-nearest-neighbors as your verification algorithm and that is pretty good too. Many things are possible. But in any case, no, you don't retrain the metric learning part.

Although, if you want to retrain or fine tune or do anything like that the API is fully documented. There are introduction examples to the DNN API as well as a full API reference. http://dlib.net/faq.html#Whereisthedocumentationforobjectfunction

As for training data, as I said before: I'm probably not going to post the data as it's a big dataset and I don't want to deal with hosting it. Also, the Microsoft celeb-1M dataset is out now which is bigger than mine anyway. So you might as well get that dataset instead.

Anirud Thyagharajan said...

Ah, I see. Thank you so much for your comprehensive reply. I will try it out for other image sets.

I tried it for the example file given in faces/2007_007763.jpg in the examples folder of the dlib Github repository, but the clustering didn't quite turn out correct. Is there any kind of preprocessing required for this to work out? Also, is there any necessity for more images of the same identity to be present for the clustering to work?

Davis King said...

Nothing is perfect. The examples are what they are. What is best for any application depends on the details and computer vision and machine learning is complex. I can always find some additional thing to do or change to some standard technique that makes something more or less applicable to any given problem.

Anirud Thyagharajan said...

Very true, this could be a specific outlier.

Thank you so much for your time and effort in replying to me, very much appreciated, and a great tool it is indeed!

Davis King said...

No problem :)

Christian Otto said...

I'm wondering if I did something wrong when compiling the dnn_face_recognition_ex.cpp since it appears to be very slow (it runs about 7 mins). Does it make use of the GPU? Do I have to enable something for it to do so?

Davis King said...

It will use CUDA and cuDNN if you have it installed. Also, are you using visual studio? http://dlib.net/faq.html#Whyisdlibslow

Christian Otto said...

No I'm on ubuntu and just build with CMake. Does the cudnn version matter? And does it use cuda if it is not installed in the standard location? Thanks!

Davis King said...

The CMake output tells you what is happening. There are big obvious messages that say things about CUDA and cuDNN, telling you what it's doing.

Christian Otto said...

Oh i never actually used the provided CMakeLists file. It told me that I was using cudnn version 4, which was wrong. It's about 10 times faster now, thank you Davis.

ARBaboon said...

Interestingly it seems dlib_face_recognition_resnet_model_v1 has poor dynamic range for 25 to 40 year old african-americans, tested with a dataset containing 200 people.

Davis King said...

Yeah, there is definitely some dataset bias. The training data I have, along with LFW, is definitely biased towards white guys in the sense that they are overrepresented in the data. I spent a while trying to gather non-white people for the training dataset to improve it but it's still somewhat biased.

Leugim said...

Interesting, I was going to try to improve OpenFace with a data set I recently crawled over the web.

Can I ask why you don't augment your data via random colour channel shift? Unless I'm mistaken, but I can't see doing that.

Also, why have you decided to prove this a python interface but your DNN face detector?

mphielipp said...

Awesome new functionality!! Thank you Davis!

Any suggestions if I want to create a dll to use this in C#?

Davis King said...

The training data was augmented with random color shifts.

I made a Python interface because people asked for it. I didn't make it for other things because less people asked/I'm busy/don't feel like it.


In general my advice for calling C++ from C# (or java) is to use SWIG, which I've found to be very convenient.

bubi said...
This comment has been removed by the author.
bubi said...

David, thank you very much for this great work. Just a simple but intriguing question: Have you used a person with different gender for hard-negative mining at the mini-batch level? Meaning a = female, p = female, n = male, or viceversa?

Davis King said...

Each mini-batch includes a mix of genders. So yes.

Luke said...

This seems so neat! Could a Python example and API be coming in the near future?

Davis King said...

There is a python example, it's discussed in this blog post.

Nithish Chauhan said...

Sir You are awesome! and dlib Library too . I really like your Dlib Library it helped me a lot .

I am working in image & video analytics team as a researcher in a company . I have around 2 year programming experience in C++ . Sir how to start writing codes such as Dlib . I really like your C++ codes and they do wonders .I find sometimes difficult to write classes that are usable in C++ . I really need your guidance like where to start and how to improve code on a daily basis. Thanks in Advance.

Davis King said...

You should study by reading books: http://dlib.net/books.html. That is the best way to get started. Anyone who tells you otherwise is leading you astray.

Unknown said...

Hi thanks for the hard work! I was wondering whether you would be able to release the images that you have used so that I can train a model from scratch using Tensorflow?

Unknown said...

Hi thanks for the hard work! I was wondering whether you would be able to release the images that you have used so that I can train a model from scratch using Tensorflow?

Davis King said...

I'm probably not going to post the data as it's a big dataset and I don't want to deal with hosting it. Also, the Microsoft celeb-1M dataset is out now which is bigger than mine anyway. So you might as well get that dataset instead.

keep wandering said...

Can you post the specs of the machine you used to train the model, also the time it took? Since I am thinking of re-training with a more diverse dataset. Thanks

Davis King said...

Training took about a day on an older titan x.

Vladislav Lesov said...
This comment has been removed by the author.
Davis King said...

That's literally what the example program mentioned in this blog post does. It clusters 128D faces with chinese whispers.

Vladislav Lesov said...
This comment has been removed by the author.
KEN LO said...

Thanks for this great work! I try other pictures with non-bald faces, and find that: all of non-bald faces are categorized to the same one person, but the bald faces can be correctly categorized to the right people. Is this the training problems ? Or I should use a special way to run non-bald faces ?

Davis King said...

I don't know why you are having trouble, but there is nothing special about bald faces.

ARBaboon said...

I find tweaking the threshold does wonders. I understand the net was trained to 0.6 but I have better results at 0.45 . This is only an observation.

Davis King said...

Yes, different uses and data sets will probably benefit from adjusting the threshold.

Gavin said...

Hi Davis,

what preprocessing should I do if I want to finetune or train with my own face data?

1. rotate roll angle
2. rescale to 150*150
3. any other things?

Davis King said...

Yes, only the processing shown in the example program is needed.

bright 910570 said...

Another awesome work of you, thanks a lot!

I'd like to use dnn for detection instead of fhog used in this example, but it seems that the shape predictor can not directly take the result that the net provided as input. How do I convert the net to something can be used in this piece of example?

Davis King said...

The shape predictor just requires a rectangle as input. It doesn't matter where the rectangle comes from so long as it's on the object you want.

Алексей Волков said...

Hello, i have a strange problem compiling dnn_face_recognition_ex.cpp. It just freezes, the CPU usage is max, but nothing happens (waited for 30 minutes). I figured out that it was caused by very long type names, generated by templates. If i use alevel4>>>>> type, it's ok, but alevel3>>>>>> makes a problem. The compiler is supposed to raise a warning 4503 (unless disabled), but not freeze. Tried to install the latest VS 2017 Enterprise, didn't help.
What would you advice to workaround the problem?

Davis King said...

This only happens in visual studio since it has terrible C++11 support. You can make it work in visual studio 2015, but visual studio 2017 has even worse C++11 support than 2015 apparently (a lot of users who are trying VC2017 have been complaining to me).

ARBaboon said...

I had to switch to CLang. In cmake you can still have the generator set to VS2017 but set your toolset to v141_clang_c2 . Since then I have actually started to use a direct install of LLVM and I use the LLVM-vs2014 toolset (even though I use VS2017). I have altered the dlib cmake files a bunch to tell not to disable features on MSVC if you are using clang but I think you can still get what you want with the cmake files that come with dlib.

Stefanelus said...

Davis, there is any way to decrease the sliding window size ? Now it 80x80 I think.

Davis King said...

Make the input image bigger or train a new detector with a smaller window.

Jay Doyle said...

Hi Davis,

I am using your dnn_metric_learning_on_images_ex.cpp to train on images that are roughly twice as wide as they are tall. I am using the example code (dnn_face_recognition_ex.cpp) to evaluate the trained net. The random_cropper appears to transpose the rows/cols (line 219: get_rect(img)), returning incorrect crops. I swapped the rows/cols and now get better crops, but there still appears to be an error in cropper when handling non-square crops. I can share some images with you if you would like.

I also noticed that the cropper.set_randomly_flip() is set to true, which will feed mirrored faces back to the net. This seems incorrect, but you may have a good reason for doing it.

Thanks,

Jay

Davis King said...

You have to setup the cropping in a way that's appropriate for your problem. There probably isn't "One True Random Cropper" that everyone can always use. Although I'm sure there might be usability improvements to the dlib random cropper object used in that example. But at the end of the day it's up to you do decide how to build the mini-batches.

Đức Lê Huỳnh said...


I have two computers : one MAC and one Window.
When I use dlib - dnn to embed vector , the time it took when embedding 1 image on Mac is 0.05s while on Window it took 0.3s. Why is there such a big difference?
(since the two machines are on par with each other, in terms of specs)
My MAC config: Core i5, CPU 2.6 Ghz, RAM 8Gb,
and My Window config: Core i7, CPU 2.4 Ghz, RAM 32Gb

Thanks,

Duc

Davis King said...

http://dlib.net/faq.html#Whyisdlibslow

gaurav gupta said...

Why does face recognition model take landmarks(shape) as input in the python example?How is it used to compute the 128D face description?
face_descriptor = facerec.compute_face_descriptor(img, shape)

Duc Le Huynh said...

i use QT Creator, i don't use Visual Studio
http://dlib.net/faq.html#Whyisdlibslow only support for Visual Studio

Davis King said...

Maybe you have a BLAS library on one machine and not on other then. I don't know.

Duc Le Huynh said...
This comment has been removed by the author.
gaurav gupta said...
This comment has been removed by the author.
Steven said...

Hello Davis,



I have been asked to download the dlib library on my windows system. I have followed the instructions given here: http://www.paulvangent.com/2016/08/05/emotion-recognition-using-facial-landmarks/

I have already downloaded cmake and Visual Studio. On the dlib folder command prompt, on running "python setup.py install" I get the problem that you can see in the picture added.

http://imgur.com/QJZyh0c

I have python 3.6 with conda running in window 7 64 bit, The boost and cmake was done ok, also I installed visual studio but the last part (python setup.py install) does not work.

Thanks in advance

Davis King said...

You didn't install the C++ part of visual studio. You have to select C++ when you install it or it's just going to install other visual studio stuff.

gaurav gupta said...

how is dlib's face descriptor embedding from facenet's embeddings?
https://arxiv.org/pdf/1503.03832.pdf
https://github.com/davidsandberg/facenet

sutony said...

Hi Davis King,
From your response to Kyle McDonald's comment, I learned more about how you clean the dataset. Thanks for sharing your experience.

However, I still don't understand how you used the graph clustering method. Is it used for either one of the purposes below?
1) automatically merging same identities from different datasets(vgg/facescrub)?
Or
2) clustering similar faces within an identity's folder so that you can more easily pick out the outliers manually.

Davis King said...

Both. It's still a very manual process. You have to do a lot of review to make sure the labeling is going to be improved. These automated tricks are just to help you review the data and find labeling mistakes. They aren't going to create a cleaned dataset for you.

Danish Nazir said...

Hi Davis first of all great work (y) i just wanted to ask that if there is a python implementation of your recognition model that you just described above i found a recognition algorithm i.e http://dlib.net/face_recognition.py.html but this is very limited as compared to your C++ example so i was hoping if you could provide an example in python and in which you could draw the histogram and then apply the chinese whispers algorithm! it would be a great help if you can do such a thing
Regards,

Gavin said...

Hi Davis,

I'm curious that why using euclidean metric as loss metric. I see a lot of other frameworks using cosine similarity to test the similarity between faces. Is it by design?

Davis King said...

I found that this works better.

AlexAnd said...

Hello Davis,
Could you explain, please, the treshold value (0.6) - is it there by design? Can it be set lower/higher and for how much? In 128D even slightest increase/decrease of it should mean a HUGE difference in volume, am I right?

Davis King said...

You can set it to whatever you want. See what happens. Maybe it works for your problem or maybe it doesn't.

AlexAnd said...

Thank you for your answer. Just one more question - only RGB images should be used for this particular pretrained DNN as an input? Could I use grayscale ones instead? Dlib's face detection eats them perfectly, but here I see no options. Though I'm quite a nub in face recognition, but for me it seems obvious, that such unreliable thing as color information (different light conditions, changed skin tint/make up, etc.) should offer not much of a real help for the process, am I right? Anyway - is there any reason for me to try to define something like input_grayscale_image, so the data could be transferred into tensors in a proper way?

Davis King said...

You don't have to do anything. The existing code will load a jpeg or png or whatever and process it just the same regardless of it being color or gray. As for how well it work will without color, I have no idea. It's probably alright, but maybe not quite as good.

AlexAnd said...

Thank you once again. Actually, I just try to deal with this DNN directly from my application, where I produce only grayscale images. Last question: what R,G,B average values (122.782, 117.001, 104.298) are there for? I cloned your input_rgb_image_sized to the new input type (all the same, only luminance will go to R,G,B), but have no idea about these offsets. So, in order to achieve best results, should it be put there in the same asymmetric way, or just be in [-0.5, +0.5] bounds, simply copied for all color channels?

Dreamer said...

Hi from dblib I find many examples that can differentiate faces, can it differentiate different objects like bottles or bags based on their color. And which algorithm do you think can help. And link or suggestion will be very helpful. Thanks in advance.

Davis King said...

If you get a big training dataset I'm sure you could make something that does that. There are links to the example programs that show how to train new models referenced from this post. In particular, read the C++ face recognition examples.

Antonio Jesús said...

Hi Davis,

I'm using openface: http://cmusatyalab.github.io/openface/demo-3-classifier/ for training a classifier. I have 1.500 people with around 50 images per person which makes 80k images and generate a huge classifier that take ~35 seconds to predict a person.

Now I want to scale that up, but if I increase the amount of people to 10k it will take forever. My current machine is what I have, will your approach have better performance? I mean in time not in accuracy. Or could you give me some advices about what to improve/change/use?

Thanks

Davis King said...

You aren't going to make a very good model with a dataset that small. Fortunately, you can use the free model that comes with dlib. It is trained on millions of faces and gets state-of-the-art accuracy on the standard LFW benchmark for face recognition.

Antonio Jesús said...

I think I didn't explain myself properly. I don't want to generate a new model. From what I got, in your post, you have a clustered image of different actors and it can cluster them by actor. I guess the name you just put it in the picture. But I have right now 80.000 images and I'm planning to expand it to 800.000 if I keep having 50 images per person. How can I use a new picture that isn't on this model and make a prediction of who has more similarities with my new image as openface is doing

QA Collective said...

Just briefly, congratulations on the dlib library, its fantastic and I'm only just getting to know it - as it's helping both speed up my code and make it work with higher accuracy!

I have a question regarding face descriptors. I am tracking unique faces seen in video clips in a real-time application. So with a stream of frames containing faces, I already compute the euclidean distance between faces to ensure I haven't seen a new one. If I do see a new face, I collate descriptors into an inventory for that face.

I'm wondering, for the inventory (cluster) of descriptors that I've already gathered for each 'unique face', would there be any benefit in computing an 'averaged descriptor' for each person - say over 50 frames?

My thinking is that this might help identify that face more accurately (or in 'fringe cases'), because it could account for a moving mouth (while speaking) and blinking eyes and angles of the face nodding a head etc

I only used the phrase 'averaged descriptor' in an abstract sense as I have no idea mathematically how best I would do this if it were deemed a good idea - would I literally do an index-wise average over 50 vectors to produce a single vector?

Andrew

Davis King said...

Thanks, I'm glad you like dlib :)

Trying to make an average isn't usually going to work very well, due to the unintuitive geometry of high dimensional spaces. To be specific, suppose you have two sets of points in 128 dimensions, call them A and B, such that all the points in A are within 0.6 distance of each other, and similarly all the points in B are within 0.6 distance of each other. Moreover, suppose that none of the points in A are within 0.6 distance of any point in B.

It is surprising, but true, that it's quite likely that the distance between the centroid (i.e. the average of all the points) of A and the centroid of B is less than 0.6 apart. This kind of thing can happen in low dimensions as well, but it becomes increasingly more likely to happen when the dimension goes up.

So using an average is generally not a good idea. You should instead use a k-nearest-neighbor type of algorithm to do classification.

QA Collective said...

Did you mean to say 'greater than 0.6' apart in your sentence above? That would seem to make more sense in the context of what you were saying.

In this case, I will begin to teach myself about k-nearest-neighbor algorithms, as I've been hearing a lot about them in my reading of late :)

Davis King said...

No. I meant less than. It is quite counterintuitive :)

QA Collective said...

haha right you are then on the counterintuiveness! Although I do now understand what you meant - I hadn't read your suppositions in the previous paragraph carefully enough. My apologies.

Davis King said...

No worries :)

Antonio Jesús said...

sorry Davis to insist but do you have any hint?

John Mac said...


Very simple question on the use of dlib in this example:
I am interested in comparing an unknown face one by one with a large number of known faces.
I can see in the example code that:
face_descriptor = facerec.compute_face_descriptor(img, shape)
will give me a 128D vector for my unknown face.

If I have a database of all the 128D vectors for all the known faces, how can I compare
two 128D vectors to get the distance between them (ie: similarly of the faces)?

Davis King said...

Use a for loop? There are a lot of ways to code it.

lbouza said...

Hi Davis, very nice work with dlib! I'm a PhD student working in Face Recognition and I have used dlib a lot for face detection, landmark localization, tracking, etc. with remarkable results. Now, I'm trying to replicate your results following the LFW protocol. Doing so, one first question arises, which images did you used? As you know, there are different sets of the LFW database according to the aligment method used, i.e., the original aligned, funneled, deep funneled or lfw-a images. Or did you performed a different alignment/preprocessing to the images?

Davis King said...

I used the regular old unaligned images in LFW and ran them through the alignment procedure you see in the dlib example programs.

Jon Hauris said...

Fantastic stuff. Thanks for all you've done!
I am doing face detection / recognition on IR images. This means I cannot use the standard features for detection or recognition. I am trying tobuild my own detctor using your "train_object_detector.py" and it is working really well - mostly.
I have a training set that are faces of one size and the detector is getting faces of similar sizes but completely missing smaller face images.

So my question is how does the system work to detect faces of different sizes. Do you need to have training samples of all the sizes that you expect to be finding? Or does the system take in the training images and resize them?

If you could clarify how this process works and what kind of training set I need and how it works to find faces of different sizes, I would really appreciate it. I have the recognizer working well, I just need to find the faces.

Thank you, Jon Hauris

Jon Hauris said...

I guess what I am asking is how does the stride and pyramid layering work and how do we control it. I am using the python interface.
Thanks

Davis King said...

You can't control it through python. But more importantly, you should read about image pyramids if you want to understand what it's doing. Wikipedia explains them well. And more to your question, the detector finds all objects in an image bigger than the "detection window" which is user specified. If you want to find smaller objects either train with a smaller window or resize your images so they are bigger. Usually resizing is the best strategy.

Jon Hauris said...

thank you, that is exactly waht I needed to know.

Jon Hauris said...

Iwould really appreciate it if you could also tell me what this parameter does and how to interpret/adjust it: options.upsample_limit = 2;
What happens if I increase/decrease this value?
Thank you, Jon

Davis King said...

It's described in the documentation: http://dlib.net/python/index.html

Stefanelus said...

hey Davis,

the face recognition tool is using the DNN distance metric, I was trying to use it on a gpu to training with some data.
The gpu utilisation under 7% and after like a while is 0 and the whole process is very slow. There are things that I can do
to make the whole thing faster ?

Stefan

Davis King said...

Run multiple images at a time by putting them into a batch.

Shubham Juneja said...

Hi. Great stuff, thanks for this. I have a question, I have been using the code in the face_recognition.py for trying out the LFW protocol myself. As it says I should get 99.13 without the third parameter, while I only get 99.02 somehow. Could it be that the file face_recognition.py is missing something that you used while testing ?

Also the default dlib face detector sometimes misses out some face detections while testing over LFW.

Davis King said...

LFW isn't about testing detection, just recognition. So you have to measure the accuracy of only the recognition component. This means you don't throw away any images in the LFW set when you do the evaluation.

Sliver21 said...

Completely true, I agree with you. Hence I went for a state of the art detector. Yet the accuracy with specified jitter/resampling params reaches 0.11% less than each of the 3 mentioned ones.

Davis King said...

Use dlib's detector since that's what the whole pipeline is trained with, but for faces it doesn't find supply boxes that are similar to what the dlib detector would have provided.

Shubham Juneja said...

Thank you for your reply :)

Jon Hauris said...

I have trained my own IR face detector using train_object_detector.py
I have 2 sizes of faces:
a.) 362x292 face and
b.) 108x82 face inside a 362x292 image
I set options.upsample_limit = 4; and options.detection_window_size = 80*80;
And I trained only on the a.) faces
When I run the detector it finds the a=362x292 sized faces but not the b=108x82.
-----
When I manually resize the b. by 2x, i.e. 724x584, it finds the resized 108x82 faces.
-----
Shouldn't the upsample_limit be doing this 2x resizing and finding the b. images?
Otherwise it would be upsampling the 724x584 by 2, 3, &, 4 and this would overlap the
2, 3, 4, upsampling of the 362x292 images
4x(362x292) = 2x(724x584)
So, if the upsampling was working it should get the b. images.
Thanks, Jon

Davis King said...

upsample_limit applies only during training. When you use the detector it's up to you to prepare your image by upsampling, downsampling, cropping, or whatever else you think is appropriate before you run the detector.

Jon Hauris said...

OK, THANKS

Eslam Snono said...
This comment has been removed by the author.
Eslam Snono said...

hi,
i have my own database for 3 persons and want to use them as reference to check if they exist in picture. i try to make train model using dnn_metric_learning_on_images_ex but each time gives me error in dlib/dnn/loss.h.so can you help me to use my database for face recognition here.
thanks in advance

Davis King said...

You don't need to train anything. You should use the trained model file mentioned in this blog post.

Bill Klein said...

Great work here. I notice that detection/normalization/description works very quickly (~50ms), however, jitter_image() is a major bottle neck to say the least (> 2s in my tests).

I'm wondering if anyone can suggest a less-than-perfect, but better-than-nothing jitter technique that won't add more than a few ms of time...?

bkj said...

Is there a script available anywhere for reproducing the LFW results?

Specifically, I see above you run the face alignment algorithm, but do you also run the face detector? Or do you just run the aligner on the raw image?

Davis King said...

Edit jitter_image so it does less rounds of jittering. There is a for loop inside it, just change it to run fewer times.


As for the LFW test script, the entire program that runs the LFW test can be found here: http://dlib.net/files/dlib_face_recognition_resnet_model_v1_lfw_test_scripts.tar.bz2

Bill Klein said...

Thanks for the great support! One other question: In the example, is the use of net (anet_type) thread safe?

I know the general rule for dlib, but after looking through the code/docs I couldn't figure out whether the use of net is truly non-modifying, and therefore thread safe without sync...

Davis King said...

If you look at the documentation you will see that running anything through a net modifies the net in a variety of ways, so not, it's not thread safe.

Bill Klein said...

Thanks again for the help. However, what documentation are you referring to? I've read the docs for loss_metric and many others, but haven't seen explicit mention of that... Perhaps that's because a number of classes are involved and I'm not looking at the right one. I will continue reading up.

Also, am I right in understanding that this all happens in a single thread? Looking at the docs/headers I'm trying to figure out if it's already parallelized, but can't figure it out.

BTW, even aside from your great support, I am surprised at how easy it was to get everything working and how well it works. Amazing stuff.

Davis King said...

Practically all of the documentation. For instance, didn't you notice you can call get_output() on different layers on the net to see the output of that layer? How could that work if the output wasn't stored in the net?

KEN LO said...

Hi Davis,

Is it patent-free when i use this deep learning face recognition algorithm or other dlib algorithms for commercial usage ?

Davis King said...

I don't know of any patents that cover algorithms in dlib.

Ola Glasmann said...

I have been toying around with converting the model to Tensorflow, which I'm more familiar with than C++. Is there any preprocessing on the input images before they are fed to the network? It looks like the class input_rgb_image_sized does some subtraction of RGB-mean values and divides by 256, is that also performed on the input faces for this network?

Also, I see the face landmarks are passed to the facerec model. Is this to do some fancy face alignment before feeding to the network?

Great work on this, and dlib in general!

Davis King said...

The python code is somewhat opaque. However, if you look at the C++ example it's all laid out for you to see: http://dlib.net/dnn_face_recognition_ex.cpp.html

Ola Glasmann said...

Thanks for the swift reply! I'm not fluent in C++ so this might be a stupid question, but I see input_rgb_image_sized<150> is the first "layer" of the network. Does this mean it automatically does the mean subtraction on input images (based on the to_tensor function in class input_rgb_image_sized from dlib/dnn/input.h).

I'm asking because I get different results on the same image in my Tensorflow version of the network and the dlib implementation, and I'm wondering whether it's some bug of my own making or simply differences in preprocessing.

Davis King said...

Yes, it subtracts and scales the input pixels.

Giorgos B. said...

Hello Davis,
I admire your work very much, and wish to congratulate you for that! I'm willing to build a gender recognition algorithm with age classes (baby,child,teen,young adult,adult,senior), with the help of dlib. What are your thoughts about that?How can i train a classifier (like svm) do that? I've already found about 3000 images of people and spanned them across the above classes..
Thanks a lot,
George!

Davis King said...

Thanks, I'm glad you like dlib. There are lots of examples in the examples folder that show how to do classification. Pick one and start from there.

pravallika kollipara said...

hey davis,

I tried using the dlib for face recognition.I tried it with both c++ and python but the problem is that i am getting a different vector as the face descriptor for both...do you know any reason as to why this is happening??

Davis King said...

Given the same inputs you should get the same outputs.

pravallika kollipara said...

ya ..my bad it gives the same output but the problem is that in c++ it gives in float format and in python it gives in long double format this leads to approximation in c++.i tried to define the matrix as long double in c++ but it gives an error that net(faces) is of matrix type and cant be converted to matrix
can you put a suggestion as to what is to be done.i am a novice in this so dont mind all the beginner questions please... :)


thanks!!

Davis King said...

Don't worry about it, float vs double doesn't matter here.