Sunday, February 12, 2017

High Quality Face Recognition with Deep Metric Learning

Since the last dlib release, I've been working on adding easy to use deep metric learning tooling to dlib. Deep metric learning is useful for a lot of things, but the most popular application is face recognition. So obviously I had to add a face recognition example program to dlib. The new example comes with pictures of bald Hollywood action heroes and uses the provided deep metric model to identify how many different people there are and which faces belong to each person. The input images are shown below along with the four automatically identified face clusters:




Just like all the other example dlib models, the pretrained model used by this example program is in the public domain. So you can use it for anything you want. Also, the model has an accuracy of 99.38% on the standard Labeled Faces in the Wild benchmark. This is comparable to other state-of-the-art models and means that, given two face images, it correctly predicts if the images are of the same person 99.38% of the time.

For those interested in the model details, this model is a ResNet network with 29 conv layers. It's essentially a version of the ResNet-34 network from the paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun with a few layers removed and the number of filters per layer reduced by half.

The network was trained from scratch on a dataset of about 3 million faces. This dataset is derived from a number of datasets. The face scrub dataset[2], the VGG dataset[1], and then a large number of images I personally scraped from the internet. I tried as best I could to clean up the combined dataset by removing labeling errors, which meant filtering out a lot of stuff from VGG. I did this by repeatedly training a face recognition model and then using graph clustering methods and a lot of manual review to clean up the dataset. In the end, about half the images are from VGG and face scrub. Also, the total number of individual identities in the dataset is 7485. I made sure to avoid overlap with identities in LFW so the LFW evaluation would be valid.

The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level. The training code is obviously also available, since that sort of thing is basically the point of dlib. You can find all details on training and model specifics by reading the example program and consulting the referenced parts of dlib.  There is also a Python API for accessing the face recognition model.



[1] O. M. Parkhi, A. Vedaldi, A. Zisserman Deep Face Recognition British Machine Vision Conference, 2015.
[2] H.-W. Ng, S. Winkler. A data-driven approach to cleaning large face datasets. Proc. IEEE International Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014

232 comments :

1 – 200 of 232   Newer›   Newest»
Mohamed Ikbel Boulabiar said...

Can it detects if someone is not in the base?
Detecting unknown people is a problem in another library with no way to say if a face is not in the labelled faces base.

Davis King said...

Yes. At the end of the day, this is a classifier that tells you if two images are of the same person. Half its job is to say "no" when they aren't.

Kyle McDonald said...

Could you say a little more about what "graph clustering methods" you used here? I'm interested in using this on a dataset to cluster unknown identities. Right now I have a few ideas: 1.) just to k-means, 2.) do the n^2 comparisons, then do k-means on those rows, 3.( take each face and compare it to the n-1 others, assign it to the best match, and then at the end group all the faces that are part of the same set (don't know if there's a name for #2 or #3...)

Davis King said...

The one you probably want to use is the one in the example program, the "Chinese Whispers" algorithm. The paper describing the method is referenced in the dlib documentation. It's a really simple iterative graph neighbor relabeling algorithm that gives surprisingly good results. It's what made the 4 clusters in this example. You don't even tell it how many clusters there are.

There are also graph clustering methods like modularity clustering, which is also in dlib, but I've found on many problems that a simple method like Chinese whispers gives better results. Which is surprising considering how theoretically well motivated modularity clustering is.

As for what else I did to clean up the data. I would sort pairs of identities by how similar their average similarity was. That helped find cases where the same person appeared under two names. Then I would also sort all the images for a given person by how close they were to the centroid of their class. If you then look at that sorted list you can see obvious labeling errors accumulate at the end and remove them. There were a bunch of other minor variations on that kind of theme with a bunch of manual review. A LOT of manual review.

Kyle McDonald said...

Thanks! I just looked into the Chinese whispers algorithm. It feels like a graphical version of the k-medoids algorithm, except you're changing the assignments of each item instead of changing the medoid assignment. It makes sense to me that it would converge on something useful if the initialization is good, but I would expect it to have similar problems as k-means where bad initialization can cause degenerate assignments. I'll run it a few times and look for the best results :)

Davis King said...

You will be surprised. It's very good considering it's a really simple method. I'm still slightly mystified that it's better than modularity clustering but what's always been my experience.

I've also found that that the random initialization is irrelevant. It always seems to converge to something pretty sensible. The only thing I can say that's bad, aside from the name being maybe slightly racist, is that sometimes I've found it useful to do some kind of post processing to clean up the results. e.g. looking at clusters and checking if any of them have a lot of edges between them and merging them after the fact. But usually it's pretty good.

ngap wei Tham said...

The comments of cpp example mentioned

"This model has a 99.38% accuracy on the standard LFW face recognition benchmark, which is comparable to other state-of-the-art methods for face recognition as of February 2017."

But this post said

"given two face images, it correctly predicts if the images are of the same person 99.38% of the time."

It sound more like verification(A equal to B) rather than recognition(Who is A?). 99.38% accuracy is verification nor recognition?

Davis King said...

It's 99.38% according to the LFW evaluation protocol. Complain to the LFW people about the choice of words if you don't like it.

钟华平 said...

I used the code in python_examples/face_recognition.py to get descriptors for two given face images and then calculate the cosine similarity between these two 128D descriptors so as to verify whether these two face images are from the same person. However, I found that although the input images are not from the same person, the similarity will be very high (greater than 0.9). Actually, I used the images from LFW to verify the code.

钟华平 said...
This comment has been removed by the author.
Davis King said...

As the example says, Use Euclidean distance, not cosine similarity.

florisdesmedt said...

Another great extention of the dlib library! Is there a reason the CPU HOG-based frontal face detector is used instead of the (more accurate) dnn version (except training a model for only frontal faces)?

Best regards

Davis King said...

Thanks. No reason other than the HOG detector is faster.

ngap wei Tham said...

>The network was trained from scratch on a dataset of about 3 million faces

Thanks for the model and nice example.
Is is possible to make the dataset public available?

Davis King said...

I'm probably not going to post the data as it's a big dataset and I don't want to deal with hosting it. Also, the Microsoft celeb-1M dataset is out now which is bigger than mine anyway. So you might as well get that dataset instead.

gaurav gupta said...

How is it compared to betaface ?
https://www.betaface.com/wpa/

Davis King said...

I have no idea, do they post their accuracy on the LFW benchmark? I posted my LFW accuracy, so you can use that to compare against other tools.

Davis King said...

Turns out betaface has their accuracy listed on the LFW results page (http://vis-www.cs.umass.edu/lfw/results.html). It's only 98.08% apparently.

gaurav gupta said...

I tried using dlib face detection in a bit blurred image. Couldn't find any results. But betaface detected the face in the same image. Is there any preprocessing required?

Davis King said...

Maybe the face is too small and you need to make the image bigger, I don't know.

Davis King said...

You could also always try this detector (http://blog.dlib.net/2016/10/easily-create-high-quality-object.html) instead of the one used in the face recognition example program.

richardliao said...

I have tried to use dlib to detect anime faces but only work less than 50% of the time. Is there anyway I can twist the code to do so without going through manual labeling and retraining models? Thanks!

Davis King said...

I doubt it. I would train a detector. It's pretty easy to do.

Kasper van Zon said...

I would like to play around with this Face Recognition network in combination with the OpenCV VideoCapture. The images from OpenCV (dlib::cv_image) are however in bgr pixel format and I am assuming that the face network is trained with rgb images. Would it make a big difference if I feed the network bgr images? Or does dlib have an efficient routine to convert from bgr to rgb?

Davis King said...

The images need to be RGB. If you are using C++ pretty much any way to convert the image is fine. I don't know what's a sensible method in Python.

Kasper van Zon said...

Thank you for the information, and for making your awesome library publicly available!

Converting the images in C++ is indeed relative easy. I was just checking if there wasn't already something like a SIMD optimized pixel conversion routine in dlib.

Davis King said...

No problem :)

You can also make a new input layer that reads directly from an OpenCV image if you feel the need. It's easy to do since the input layer interface you have to implement is fully documented: http://dlib.net/dlib/dnn/input_abstract.h.html#EXAMPLE_INPUT_LAYER

Daniel Sáez said...

Do you have any reference to the structure metric loss that you used? Thanks!

Davis King said...

The loss is described in the loss_metric_ documentation. However, I don't have a reference paper for it.

Anirud Thyagharajan said...

Adding to what Mohamed Ikbel asked, would it not be required to train the network again for the task of face verification of some faces of some identities that were not present in the dataset?

This is a brilliant piece of code, giving the power to change metric functions as well, Kudos for that.

I'm also interested as to how to approach fine tuning the pretrained net, are there any APIs present for that? Thanks!

Yatong Zhang said...

Can you share the images you trained the model?

Davis King said...

No, it's not required to retrain. The model posted wasn't trained on any of the faces/identities in LFW for example. The whole point of this type of model is that you don't need to do that kind of target specific training, which is why metric learning style algorithms are so popular for face recognition and verification right now. That's not to say that you don't, as a post processing step, combine some kind of target specific SVM or something that operates on top of the metric learning algorithm. People sometimes do that and it can improve verification. But you can also just do k-nearest-neighbors as your verification algorithm and that is pretty good too. Many things are possible. But in any case, no, you don't retrain the metric learning part.

Although, if you want to retrain or fine tune or do anything like that the API is fully documented. There are introduction examples to the DNN API as well as a full API reference. http://dlib.net/faq.html#Whereisthedocumentationforobjectfunction

As for training data, as I said before: I'm probably not going to post the data as it's a big dataset and I don't want to deal with hosting it. Also, the Microsoft celeb-1M dataset is out now which is bigger than mine anyway. So you might as well get that dataset instead.

Anirud Thyagharajan said...

Ah, I see. Thank you so much for your comprehensive reply. I will try it out for other image sets.

I tried it for the example file given in faces/2007_007763.jpg in the examples folder of the dlib Github repository, but the clustering didn't quite turn out correct. Is there any kind of preprocessing required for this to work out? Also, is there any necessity for more images of the same identity to be present for the clustering to work?

Davis King said...

Nothing is perfect. The examples are what they are. What is best for any application depends on the details and computer vision and machine learning is complex. I can always find some additional thing to do or change to some standard technique that makes something more or less applicable to any given problem.

Anirud Thyagharajan said...

Very true, this could be a specific outlier.

Thank you so much for your time and effort in replying to me, very much appreciated, and a great tool it is indeed!

Davis King said...

No problem :)

Christian Otto said...

I'm wondering if I did something wrong when compiling the dnn_face_recognition_ex.cpp since it appears to be very slow (it runs about 7 mins). Does it make use of the GPU? Do I have to enable something for it to do so?

Davis King said...

It will use CUDA and cuDNN if you have it installed. Also, are you using visual studio? http://dlib.net/faq.html#Whyisdlibslow

Christian Otto said...

No I'm on ubuntu and just build with CMake. Does the cudnn version matter? And does it use cuda if it is not installed in the standard location? Thanks!

Davis King said...

The CMake output tells you what is happening. There are big obvious messages that say things about CUDA and cuDNN, telling you what it's doing.

Christian Otto said...

Oh i never actually used the provided CMakeLists file. It told me that I was using cudnn version 4, which was wrong. It's about 10 times faster now, thank you Davis.

ARBaboon said...

Interestingly it seems dlib_face_recognition_resnet_model_v1 has poor dynamic range for 25 to 40 year old african-americans, tested with a dataset containing 200 people.

Davis King said...

Yeah, there is definitely some dataset bias. The training data I have, along with LFW, is definitely biased towards white guys in the sense that they are overrepresented in the data. I spent a while trying to gather non-white people for the training dataset to improve it but it's still somewhat biased.

Leugim said...

Interesting, I was going to try to improve OpenFace with a data set I recently crawled over the web.

Can I ask why you don't augment your data via random colour channel shift? Unless I'm mistaken, but I can't see doing that.

Also, why have you decided to prove this a python interface but your DNN face detector?

mphielipp said...

Awesome new functionality!! Thank you Davis!

Any suggestions if I want to create a dll to use this in C#?

Davis King said...

The training data was augmented with random color shifts.

I made a Python interface because people asked for it. I didn't make it for other things because less people asked/I'm busy/don't feel like it.


In general my advice for calling C++ from C# (or java) is to use SWIG, which I've found to be very convenient.

bubi said...
This comment has been removed by the author.
bubi said...

David, thank you very much for this great work. Just a simple but intriguing question: Have you used a person with different gender for hard-negative mining at the mini-batch level? Meaning a = female, p = female, n = male, or viceversa?

Davis King said...

Each mini-batch includes a mix of genders. So yes.

Luke said...

This seems so neat! Could a Python example and API be coming in the near future?

Davis King said...

There is a python example, it's discussed in this blog post.

Nithish Chauhan said...

Sir You are awesome! and dlib Library too . I really like your Dlib Library it helped me a lot .

I am working in image & video analytics team as a researcher in a company . I have around 2 year programming experience in C++ . Sir how to start writing codes such as Dlib . I really like your C++ codes and they do wonders .I find sometimes difficult to write classes that are usable in C++ . I really need your guidance like where to start and how to improve code on a daily basis. Thanks in Advance.

Davis King said...

You should study by reading books: http://dlib.net/books.html. That is the best way to get started. Anyone who tells you otherwise is leading you astray.

Unknown said...

Hi thanks for the hard work! I was wondering whether you would be able to release the images that you have used so that I can train a model from scratch using Tensorflow?

Unknown said...

Hi thanks for the hard work! I was wondering whether you would be able to release the images that you have used so that I can train a model from scratch using Tensorflow?

Davis King said...

I'm probably not going to post the data as it's a big dataset and I don't want to deal with hosting it. Also, the Microsoft celeb-1M dataset is out now which is bigger than mine anyway. So you might as well get that dataset instead.

keep wandering said...

Can you post the specs of the machine you used to train the model, also the time it took? Since I am thinking of re-training with a more diverse dataset. Thanks

Davis King said...

Training took about a day on an older titan x.

Vladislav Lesov said...
This comment has been removed by the author.
Davis King said...

That's literally what the example program mentioned in this blog post does. It clusters 128D faces with chinese whispers.

Vladislav Lesov said...
This comment has been removed by the author.
KEN LO said...

Thanks for this great work! I try other pictures with non-bald faces, and find that: all of non-bald faces are categorized to the same one person, but the bald faces can be correctly categorized to the right people. Is this the training problems ? Or I should use a special way to run non-bald faces ?

Davis King said...

I don't know why you are having trouble, but there is nothing special about bald faces.

ARBaboon said...

I find tweaking the threshold does wonders. I understand the net was trained to 0.6 but I have better results at 0.45 . This is only an observation.

Davis King said...

Yes, different uses and data sets will probably benefit from adjusting the threshold.

Gavin said...

Hi Davis,

what preprocessing should I do if I want to finetune or train with my own face data?

1. rotate roll angle
2. rescale to 150*150
3. any other things?

Davis King said...

Yes, only the processing shown in the example program is needed.

bright 910570 said...

Another awesome work of you, thanks a lot!

I'd like to use dnn for detection instead of fhog used in this example, but it seems that the shape predictor can not directly take the result that the net provided as input. How do I convert the net to something can be used in this piece of example?

Davis King said...

The shape predictor just requires a rectangle as input. It doesn't matter where the rectangle comes from so long as it's on the object you want.

Алексей Волков said...

Hello, i have a strange problem compiling dnn_face_recognition_ex.cpp. It just freezes, the CPU usage is max, but nothing happens (waited for 30 minutes). I figured out that it was caused by very long type names, generated by templates. If i use alevel4>>>>> type, it's ok, but alevel3>>>>>> makes a problem. The compiler is supposed to raise a warning 4503 (unless disabled), but not freeze. Tried to install the latest VS 2017 Enterprise, didn't help.
What would you advice to workaround the problem?

Davis King said...

This only happens in visual studio since it has terrible C++11 support. You can make it work in visual studio 2015, but visual studio 2017 has even worse C++11 support than 2015 apparently (a lot of users who are trying VC2017 have been complaining to me).

ARBaboon said...

I had to switch to CLang. In cmake you can still have the generator set to VS2017 but set your toolset to v141_clang_c2 . Since then I have actually started to use a direct install of LLVM and I use the LLVM-vs2014 toolset (even though I use VS2017). I have altered the dlib cmake files a bunch to tell not to disable features on MSVC if you are using clang but I think you can still get what you want with the cmake files that come with dlib.

Stefanelus said...

Davis, there is any way to decrease the sliding window size ? Now it 80x80 I think.

Davis King said...

Make the input image bigger or train a new detector with a smaller window.

Jay Doyle said...

Hi Davis,

I am using your dnn_metric_learning_on_images_ex.cpp to train on images that are roughly twice as wide as they are tall. I am using the example code (dnn_face_recognition_ex.cpp) to evaluate the trained net. The random_cropper appears to transpose the rows/cols (line 219: get_rect(img)), returning incorrect crops. I swapped the rows/cols and now get better crops, but there still appears to be an error in cropper when handling non-square crops. I can share some images with you if you would like.

I also noticed that the cropper.set_randomly_flip() is set to true, which will feed mirrored faces back to the net. This seems incorrect, but you may have a good reason for doing it.

Thanks,

Jay

Davis King said...

You have to setup the cropping in a way that's appropriate for your problem. There probably isn't "One True Random Cropper" that everyone can always use. Although I'm sure there might be usability improvements to the dlib random cropper object used in that example. But at the end of the day it's up to you do decide how to build the mini-batches.

Đức Lê Huỳnh said...


I have two computers : one MAC and one Window.
When I use dlib - dnn to embed vector , the time it took when embedding 1 image on Mac is 0.05s while on Window it took 0.3s. Why is there such a big difference?
(since the two machines are on par with each other, in terms of specs)
My MAC config: Core i5, CPU 2.6 Ghz, RAM 8Gb,
and My Window config: Core i7, CPU 2.4 Ghz, RAM 32Gb

Thanks,

Duc

Davis King said...

http://dlib.net/faq.html#Whyisdlibslow

gaurav gupta said...

Why does face recognition model take landmarks(shape) as input in the python example?How is it used to compute the 128D face description?
face_descriptor = facerec.compute_face_descriptor(img, shape)

Duc Le Huynh said...

i use QT Creator, i don't use Visual Studio
http://dlib.net/faq.html#Whyisdlibslow only support for Visual Studio

Davis King said...

Maybe you have a BLAS library on one machine and not on other then. I don't know.

Duc Le Huynh said...
This comment has been removed by the author.
gaurav gupta said...
This comment has been removed by the author.
Steven said...

Hello Davis,



I have been asked to download the dlib library on my windows system. I have followed the instructions given here: http://www.paulvangent.com/2016/08/05/emotion-recognition-using-facial-landmarks/

I have already downloaded cmake and Visual Studio. On the dlib folder command prompt, on running "python setup.py install" I get the problem that you can see in the picture added.

http://imgur.com/QJZyh0c

I have python 3.6 with conda running in window 7 64 bit, The boost and cmake was done ok, also I installed visual studio but the last part (python setup.py install) does not work.

Thanks in advance

Davis King said...

You didn't install the C++ part of visual studio. You have to select C++ when you install it or it's just going to install other visual studio stuff.

gaurav gupta said...

how is dlib's face descriptor embedding from facenet's embeddings?
https://arxiv.org/pdf/1503.03832.pdf
https://github.com/davidsandberg/facenet

sutony said...

Hi Davis King,
From your response to Kyle McDonald's comment, I learned more about how you clean the dataset. Thanks for sharing your experience.

However, I still don't understand how you used the graph clustering method. Is it used for either one of the purposes below?
1) automatically merging same identities from different datasets(vgg/facescrub)?
Or
2) clustering similar faces within an identity's folder so that you can more easily pick out the outliers manually.

Davis King said...

Both. It's still a very manual process. You have to do a lot of review to make sure the labeling is going to be improved. These automated tricks are just to help you review the data and find labeling mistakes. They aren't going to create a cleaned dataset for you.

Danish Nazir said...

Hi Davis first of all great work (y) i just wanted to ask that if there is a python implementation of your recognition model that you just described above i found a recognition algorithm i.e http://dlib.net/face_recognition.py.html but this is very limited as compared to your C++ example so i was hoping if you could provide an example in python and in which you could draw the histogram and then apply the chinese whispers algorithm! it would be a great help if you can do such a thing
Regards,

Gavin said...

Hi Davis,

I'm curious that why using euclidean metric as loss metric. I see a lot of other frameworks using cosine similarity to test the similarity between faces. Is it by design?

Davis King said...

I found that this works better.

AlexAnd said...

Hello Davis,
Could you explain, please, the treshold value (0.6) - is it there by design? Can it be set lower/higher and for how much? In 128D even slightest increase/decrease of it should mean a HUGE difference in volume, am I right?

Davis King said...

You can set it to whatever you want. See what happens. Maybe it works for your problem or maybe it doesn't.

AlexAnd said...

Thank you for your answer. Just one more question - only RGB images should be used for this particular pretrained DNN as an input? Could I use grayscale ones instead? Dlib's face detection eats them perfectly, but here I see no options. Though I'm quite a nub in face recognition, but for me it seems obvious, that such unreliable thing as color information (different light conditions, changed skin tint/make up, etc.) should offer not much of a real help for the process, am I right? Anyway - is there any reason for me to try to define something like input_grayscale_image, so the data could be transferred into tensors in a proper way?

Davis King said...

You don't have to do anything. The existing code will load a jpeg or png or whatever and process it just the same regardless of it being color or gray. As for how well it work will without color, I have no idea. It's probably alright, but maybe not quite as good.

AlexAnd said...

Thank you once again. Actually, I just try to deal with this DNN directly from my application, where I produce only grayscale images. Last question: what R,G,B average values (122.782, 117.001, 104.298) are there for? I cloned your input_rgb_image_sized to the new input type (all the same, only luminance will go to R,G,B), but have no idea about these offsets. So, in order to achieve best results, should it be put there in the same asymmetric way, or just be in [-0.5, +0.5] bounds, simply copied for all color channels?

Dreamer said...

Hi from dblib I find many examples that can differentiate faces, can it differentiate different objects like bottles or bags based on their color. And which algorithm do you think can help. And link or suggestion will be very helpful. Thanks in advance.

Davis King said...

If you get a big training dataset I'm sure you could make something that does that. There are links to the example programs that show how to train new models referenced from this post. In particular, read the C++ face recognition examples.

Antonio Jesús said...

Hi Davis,

I'm using openface: http://cmusatyalab.github.io/openface/demo-3-classifier/ for training a classifier. I have 1.500 people with around 50 images per person which makes 80k images and generate a huge classifier that take ~35 seconds to predict a person.

Now I want to scale that up, but if I increase the amount of people to 10k it will take forever. My current machine is what I have, will your approach have better performance? I mean in time not in accuracy. Or could you give me some advices about what to improve/change/use?

Thanks

Davis King said...

You aren't going to make a very good model with a dataset that small. Fortunately, you can use the free model that comes with dlib. It is trained on millions of faces and gets state-of-the-art accuracy on the standard LFW benchmark for face recognition.

Antonio Jesús said...

I think I didn't explain myself properly. I don't want to generate a new model. From what I got, in your post, you have a clustered image of different actors and it can cluster them by actor. I guess the name you just put it in the picture. But I have right now 80.000 images and I'm planning to expand it to 800.000 if I keep having 50 images per person. How can I use a new picture that isn't on this model and make a prediction of who has more similarities with my new image as openface is doing

QA Collective said...

Just briefly, congratulations on the dlib library, its fantastic and I'm only just getting to know it - as it's helping both speed up my code and make it work with higher accuracy!

I have a question regarding face descriptors. I am tracking unique faces seen in video clips in a real-time application. So with a stream of frames containing faces, I already compute the euclidean distance between faces to ensure I haven't seen a new one. If I do see a new face, I collate descriptors into an inventory for that face.

I'm wondering, for the inventory (cluster) of descriptors that I've already gathered for each 'unique face', would there be any benefit in computing an 'averaged descriptor' for each person - say over 50 frames?

My thinking is that this might help identify that face more accurately (or in 'fringe cases'), because it could account for a moving mouth (while speaking) and blinking eyes and angles of the face nodding a head etc

I only used the phrase 'averaged descriptor' in an abstract sense as I have no idea mathematically how best I would do this if it were deemed a good idea - would I literally do an index-wise average over 50 vectors to produce a single vector?

Andrew

Davis King said...

Thanks, I'm glad you like dlib :)

Trying to make an average isn't usually going to work very well, due to the unintuitive geometry of high dimensional spaces. To be specific, suppose you have two sets of points in 128 dimensions, call them A and B, such that all the points in A are within 0.6 distance of each other, and similarly all the points in B are within 0.6 distance of each other. Moreover, suppose that none of the points in A are within 0.6 distance of any point in B.

It is surprising, but true, that it's quite likely that the distance between the centroid (i.e. the average of all the points) of A and the centroid of B is less than 0.6 apart. This kind of thing can happen in low dimensions as well, but it becomes increasingly more likely to happen when the dimension goes up.

So using an average is generally not a good idea. You should instead use a k-nearest-neighbor type of algorithm to do classification.

QA Collective said...

Did you mean to say 'greater than 0.6' apart in your sentence above? That would seem to make more sense in the context of what you were saying.

In this case, I will begin to teach myself about k-nearest-neighbor algorithms, as I've been hearing a lot about them in my reading of late :)

Davis King said...

No. I meant less than. It is quite counterintuitive :)

QA Collective said...

haha right you are then on the counterintuiveness! Although I do now understand what you meant - I hadn't read your suppositions in the previous paragraph carefully enough. My apologies.

Davis King said...

No worries :)

Antonio Jesús said...

sorry Davis to insist but do you have any hint?

John Mac said...


Very simple question on the use of dlib in this example:
I am interested in comparing an unknown face one by one with a large number of known faces.
I can see in the example code that:
face_descriptor = facerec.compute_face_descriptor(img, shape)
will give me a 128D vector for my unknown face.

If I have a database of all the 128D vectors for all the known faces, how can I compare
two 128D vectors to get the distance between them (ie: similarly of the faces)?

Davis King said...

Use a for loop? There are a lot of ways to code it.

lbouza said...

Hi Davis, very nice work with dlib! I'm a PhD student working in Face Recognition and I have used dlib a lot for face detection, landmark localization, tracking, etc. with remarkable results. Now, I'm trying to replicate your results following the LFW protocol. Doing so, one first question arises, which images did you used? As you know, there are different sets of the LFW database according to the aligment method used, i.e., the original aligned, funneled, deep funneled or lfw-a images. Or did you performed a different alignment/preprocessing to the images?

Davis King said...

I used the regular old unaligned images in LFW and ran them through the alignment procedure you see in the dlib example programs.

Jon Hauris said...

Fantastic stuff. Thanks for all you've done!
I am doing face detection / recognition on IR images. This means I cannot use the standard features for detection or recognition. I am trying tobuild my own detctor using your "train_object_detector.py" and it is working really well - mostly.
I have a training set that are faces of one size and the detector is getting faces of similar sizes but completely missing smaller face images.

So my question is how does the system work to detect faces of different sizes. Do you need to have training samples of all the sizes that you expect to be finding? Or does the system take in the training images and resize them?

If you could clarify how this process works and what kind of training set I need and how it works to find faces of different sizes, I would really appreciate it. I have the recognizer working well, I just need to find the faces.

Thank you, Jon Hauris

Jon Hauris said...

I guess what I am asking is how does the stride and pyramid layering work and how do we control it. I am using the python interface.
Thanks

Davis King said...

You can't control it through python. But more importantly, you should read about image pyramids if you want to understand what it's doing. Wikipedia explains them well. And more to your question, the detector finds all objects in an image bigger than the "detection window" which is user specified. If you want to find smaller objects either train with a smaller window or resize your images so they are bigger. Usually resizing is the best strategy.

Jon Hauris said...

thank you, that is exactly waht I needed to know.

Jon Hauris said...

Iwould really appreciate it if you could also tell me what this parameter does and how to interpret/adjust it: options.upsample_limit = 2;
What happens if I increase/decrease this value?
Thank you, Jon

Davis King said...

It's described in the documentation: http://dlib.net/python/index.html

Stefanelus said...

hey Davis,

the face recognition tool is using the DNN distance metric, I was trying to use it on a gpu to training with some data.
The gpu utilisation under 7% and after like a while is 0 and the whole process is very slow. There are things that I can do
to make the whole thing faster ?

Stefan

Davis King said...

Run multiple images at a time by putting them into a batch.

Shubham Juneja said...

Hi. Great stuff, thanks for this. I have a question, I have been using the code in the face_recognition.py for trying out the LFW protocol myself. As it says I should get 99.13 without the third parameter, while I only get 99.02 somehow. Could it be that the file face_recognition.py is missing something that you used while testing ?

Also the default dlib face detector sometimes misses out some face detections while testing over LFW.

Davis King said...

LFW isn't about testing detection, just recognition. So you have to measure the accuracy of only the recognition component. This means you don't throw away any images in the LFW set when you do the evaluation.

Sliver21 said...

Completely true, I agree with you. Hence I went for a state of the art detector. Yet the accuracy with specified jitter/resampling params reaches 0.11% less than each of the 3 mentioned ones.

Davis King said...

Use dlib's detector since that's what the whole pipeline is trained with, but for faces it doesn't find supply boxes that are similar to what the dlib detector would have provided.

Shubham Juneja said...

Thank you for your reply :)

Jon Hauris said...

I have trained my own IR face detector using train_object_detector.py
I have 2 sizes of faces:
a.) 362x292 face and
b.) 108x82 face inside a 362x292 image
I set options.upsample_limit = 4; and options.detection_window_size = 80*80;
And I trained only on the a.) faces
When I run the detector it finds the a=362x292 sized faces but not the b=108x82.
-----
When I manually resize the b. by 2x, i.e. 724x584, it finds the resized 108x82 faces.
-----
Shouldn't the upsample_limit be doing this 2x resizing and finding the b. images?
Otherwise it would be upsampling the 724x584 by 2, 3, &, 4 and this would overlap the
2, 3, 4, upsampling of the 362x292 images
4x(362x292) = 2x(724x584)
So, if the upsampling was working it should get the b. images.
Thanks, Jon

Davis King said...

upsample_limit applies only during training. When you use the detector it's up to you to prepare your image by upsampling, downsampling, cropping, or whatever else you think is appropriate before you run the detector.

Jon Hauris said...

OK, THANKS

Eslam Snono said...
This comment has been removed by the author.
Eslam Snono said...

hi,
i have my own database for 3 persons and want to use them as reference to check if they exist in picture. i try to make train model using dnn_metric_learning_on_images_ex but each time gives me error in dlib/dnn/loss.h.so can you help me to use my database for face recognition here.
thanks in advance

Davis King said...

You don't need to train anything. You should use the trained model file mentioned in this blog post.

Bill Klein said...

Great work here. I notice that detection/normalization/description works very quickly (~50ms), however, jitter_image() is a major bottle neck to say the least (> 2s in my tests).

I'm wondering if anyone can suggest a less-than-perfect, but better-than-nothing jitter technique that won't add more than a few ms of time...?

bkj said...

Is there a script available anywhere for reproducing the LFW results?

Specifically, I see above you run the face alignment algorithm, but do you also run the face detector? Or do you just run the aligner on the raw image?

Davis King said...

Edit jitter_image so it does less rounds of jittering. There is a for loop inside it, just change it to run fewer times.


As for the LFW test script, the entire program that runs the LFW test can be found here: http://dlib.net/files/dlib_face_recognition_resnet_model_v1_lfw_test_scripts.tar.bz2

Bill Klein said...

Thanks for the great support! One other question: In the example, is the use of net (anet_type) thread safe?

I know the general rule for dlib, but after looking through the code/docs I couldn't figure out whether the use of net is truly non-modifying, and therefore thread safe without sync...

Davis King said...

If you look at the documentation you will see that running anything through a net modifies the net in a variety of ways, so not, it's not thread safe.

Bill Klein said...

Thanks again for the help. However, what documentation are you referring to? I've read the docs for loss_metric and many others, but haven't seen explicit mention of that... Perhaps that's because a number of classes are involved and I'm not looking at the right one. I will continue reading up.

Also, am I right in understanding that this all happens in a single thread? Looking at the docs/headers I'm trying to figure out if it's already parallelized, but can't figure it out.

BTW, even aside from your great support, I am surprised at how easy it was to get everything working and how well it works. Amazing stuff.

Davis King said...

Practically all of the documentation. For instance, didn't you notice you can call get_output() on different layers on the net to see the output of that layer? How could that work if the output wasn't stored in the net?

KEN LO said...

Hi Davis,

Is it patent-free when i use this deep learning face recognition algorithm or other dlib algorithms for commercial usage ?

Davis King said...

I don't know of any patents that cover algorithms in dlib.

Ola Glasmann said...

I have been toying around with converting the model to Tensorflow, which I'm more familiar with than C++. Is there any preprocessing on the input images before they are fed to the network? It looks like the class input_rgb_image_sized does some subtraction of RGB-mean values and divides by 256, is that also performed on the input faces for this network?

Also, I see the face landmarks are passed to the facerec model. Is this to do some fancy face alignment before feeding to the network?

Great work on this, and dlib in general!

Davis King said...

The python code is somewhat opaque. However, if you look at the C++ example it's all laid out for you to see: http://dlib.net/dnn_face_recognition_ex.cpp.html

Ola Glasmann said...

Thanks for the swift reply! I'm not fluent in C++ so this might be a stupid question, but I see input_rgb_image_sized<150> is the first "layer" of the network. Does this mean it automatically does the mean subtraction on input images (based on the to_tensor function in class input_rgb_image_sized from dlib/dnn/input.h).

I'm asking because I get different results on the same image in my Tensorflow version of the network and the dlib implementation, and I'm wondering whether it's some bug of my own making or simply differences in preprocessing.

Davis King said...

Yes, it subtracts and scales the input pixels.

Giorgos B. said...

Hello Davis,
I admire your work very much, and wish to congratulate you for that! I'm willing to build a gender recognition algorithm with age classes (baby,child,teen,young adult,adult,senior), with the help of dlib. What are your thoughts about that?How can i train a classifier (like svm) do that? I've already found about 3000 images of people and spanned them across the above classes..
Thanks a lot,
George!

Davis King said...

Thanks, I'm glad you like dlib. There are lots of examples in the examples folder that show how to do classification. Pick one and start from there.

pravallika kollipara said...

hey davis,

I tried using the dlib for face recognition.I tried it with both c++ and python but the problem is that i am getting a different vector as the face descriptor for both...do you know any reason as to why this is happening??

Davis King said...

Given the same inputs you should get the same outputs.

pravallika kollipara said...

ya ..my bad it gives the same output but the problem is that in c++ it gives in float format and in python it gives in long double format this leads to approximation in c++.i tried to define the matrix as long double in c++ but it gives an error that net(faces) is of matrix type and cant be converted to matrix
can you put a suggestion as to what is to be done.i am a novice in this so dont mind all the beginner questions please... :)


thanks!!

Davis King said...

Don't worry about it, float vs double doesn't matter here.

Giorgos B. said...

Dear Davis,
On the process of building a classification program, i am trying to use FHOG (extract HOG features from images) and feed those features on a one_vs_one trainer. But i can see in some of your examples that it takes as input matrix samples, while FHOG gives array2d >! Should i try to convert those, or should i use a different trainer?
I'm a little confused because there are so many examples about classification...so maybe i should try another (easier) project instead!
Thanks!!

Giorgos B. said...

I meant the trainer takes "matrix < double,31,1 > " as input, while the FHOG outputs "array2d< matrix < float,31,1 > >" ! Html tags messed everything up!

Giorgos B. said...
This comment has been removed by the author.
Alexander L said...

Hi,

I've tried face recognition by dlib and it's really fascinating!

But it's very sadly to see, the software has a huge racial bias (like one Google has used) - thei can differntiante well "white people", but it does not differntiante "black people", so it sorts all "black man's" together to one group and all "black womans" togeter (with one mismatch where woman is sorted to man). This scenario was not "specially constructed", it was simply a first try to test an algorithmus for a "wild scenario". The image I've used was from "Heart to Heart International: Our People of the Year - Ebola fighters" (www.hearttoheart.org/our-people-of-the-year/ - big poster on top of the site). Is it possible to avoid this?

And I've three interesting questions more as following:
1. assuming the DNN is loaded with "pretrained" wheights.
2. after that it will recognize/compare the face of new (unknown/unseen) person with some probability Pn. For recognition I will compare some amount (ideally one) of nearly some pictures Px (from video stream) of the person with one or more "template" picture(s) Pt of the same person (which in turn are also nearly identical, and ideally one picture). But Px is not imperative "nearly identicaly" to Pt.

Question nr. 1 - is it better (for recognize a person as specific known person) to compare with one template picture or more (nearly identical) pictures.

Question nr. 2 is more complicated. If it is also known, that some pictures Pt (nearly identical to each other) are all of the same person, it is possible, reasonable and how to "continue train" DNN weights, so that then new recognition rate would be Pn+1 > Pn.

Question nr.3 is really experimental one - how somebody tried to combine this DNN method with Eigenfaces/Fisherfaces method so that DNN recognizes using "back projected" faces. This of coarse assumes, that much more (50 or more) variations of unknown "template" preson faces were recorded. Or it lowers the recognition rate and enhances the false-positive rate?

Thank you for considering my questions.

Cornelius Grimm said...

Hello Davis,

first of all, thank you very much for your efforts around the dlib library and making it open source! I really appreciate that!

I have a question regarding the training data of the ResNet model: Was the feret or colorferet database used for training? Or would it be valid to evaluate with the feret/colorferet face database?

Best regards
Cornelius

Davis King said...

I didn't use those datasets. They are very small and not really suitable for this kind of task. You could certainly evaluate on them if you wanted to though. You could evaluate on anything :)

The most common benchmark for this kind of thing, however, is the labeled faces in the wild benchmark.

saturnkayin said...

Hi Davis,
Thanks a lot for the DLIB library.

My question is about using a unique face descriptor from several images from the same person. In a previous post you said its not a good a idea to average all the descriptor. However, as far as I understand, you indeed do this when using jittering (matrix v1 = mean(mat(net(crops)));).

¿Can you tell me which would be the best option? ¿How about computing all distances between each other and get the mean distance?

Thanks a lot in advance.

Davis King said...

Use some classifier, like kNN or linear SVM for this.

Alexander L said...

Hi Davis,

many thanks for a great DLib library!
It is possible for you to answer some of my questions above (Jul 3)?

Many thanks,
Alex

Lubyagov Nickolay said...

Hello!
Thank you very much for your wonderful library, it's a great job !!

I probably have a similar question to several previous ones. I hope I will not be too brazen.

I'm looking for faces with video, that is, I can get many vectors of the same person (by tracking the face with the help of correlation_tracker). My task is to determine how many different people passed the camera. If a person left the frame and lost the track, and then came in, I must count it 1 time.

Often the photos are of poor quality (I try to find the best photos using the definition of the head tilt and blur, but this does not give good results), besides they are made from a great distance.

Now I take one photo, and compare its vector, with the others saved one for each person. Then, I'm looking for the minimum value of the Euclidean distance. If there is none that is less than 0.6 add this vector to other. It works badly, because of the poor quality of the photo. Especially if bad photos are used as original, for compare other, photos. They give a false positive result with others, not even similar.

I looked at the K-Nearest Neighbors algorithm, it requires a trained classifier model. But, initially, I do not have a model. I do not know who these people are.

I think: If I have, for example, 10 photos, for the first person, I will add them as a separate class, then, for the second person, I also have 10 vectors. I can compare them each vector of the second person, with the model, and I will get 10 results (some may belong to the first class, but some are not). Further, for some algorithm (I do not know what to apply for this), I have to add (or not add depending on the result), a new class to the model (class chosen from 10 vectors of the second person). I think correctly, or can I use some other, classifier / clusterizer here?

Now I understand how to use K-Nearest Neighbors, if I had 10 photos in advance, for each person, and the vectors stored in the model were calculated for them, at the same time I compared one photo to the model. But in my case the situation is different.

Many thanks!

Alexander L said...

Hello Davis!

I've done some experiments on face recognition and it is a really fun & joy :)
Onne example was really amazing - I've two mans with three different fotos of both, the both mans looks also very different, moreover one of them has a glasses on all three fotos. For some reason both persons, also all 6 fotos, were grouped to same person. I've simply don't understand why. Do you have interest to see this result (I can't make it publicly)?

You wrote also that the total number of individual identities in the dataset is 7485 and the whole dataset is mix up from different datasets and other people from internet. Can you please explain, which datasets is used?

I've also asked early, if it possible to continue train a network as new data arrived, without retrain a network from the beginning in a whole? If it would be possible, that would be a great addition because the network can be better with time continous without take too much time for retrain.

I've also another very ineresting question - your network computes 128D-vector from face, the 128D vecor is widely used dimension for such a task. Why 128? Can we get better result with higher dimension up to 256 and how much the ptraining/prediction time may grow?

And the other question, why 0.6 as distance metric were used for training/prediction? What happens (and how to do without your unpublished database) if we wang get it smaller, say 0.1?

Many thanks!

Davis King said...

Many of the magic numbers are arbitrary. Other issues are research questions and you can find out a lot by reading the literature or doing experiments yourself to see what happens.

You can fine tune the networks as you can with any other deep network. The dlib API is documented in detail on the web page.

Giorgos B. said...

Hello again Davis,
One more question, is it possible to feed Imglab with rectangles of fixed size? I mean, instead of dragging to set the rectangle's size, just choose a point which will denote the rectangles center, and the rectangle's size is fixed (NxN). I'm asking this, because i'm finding it hard to annotate rectangles of same aspect ratio, since the objects i'm trying to detect can be rotated...
Thanks a lot!

Lubyagov Nickolay said...

Hello Devis.
I find many people who have a Euclidean distance of less than 0.6, and they are different people. And such people are much more than 0.62%. When working with video, from a few frames, I also in most cases have this situation for all frames.

In this image, is the about distance 0.53.
https://drive.google.com/open?id=0B3_jMty60ScdXy1aVGt4UlEyRjQ
https://drive.google.com/open?id=0B3_jMty60ScdM3VZZGRqaUIzNFk

If i use 0.53 value, it look batter...

(For video i try to made 6 photo, find distance of old all, to all new (for insert), and next select 24-36 minimal distance (if it has), and try to find dominant class, if in this class i have more 70% of combination, use it as same person. Other ways add new, all 6 vectors, to new class. It helps a little. Mostly the error goes through all the photos....)

Davis King said...

You can't fix the size of rectangles in imglab. I would recommend drawing accurate bounding boxes in the tool. Then if you want to force them to all have some property, like a certain aspect ratio or some other thing then write a script that reads in the xml file and applies whatever kind of transformation you want. It's easy to do. The routines dlib uses to read and write those xml files are part of dlib's documented API: http://dlib.net/dlib/data_io/image_dataset_metadata.h.html

Lubyagov Nickolay said...

https://drive.google.com/open?id=0B3_jMty60ScdLV8tWHVnUWoxdXc
https://drive.google.com/open?id=0B3_jMty60ScdaFdMNG5zMDBVVmc
https://drive.google.com/open?id=0B3_jMty60ScdOWhnZ1JYUkVrbXM
Same situation

Alexander L said...

Hello Davis & Nickolay,

in above examples girls have closed eyes and opened mouth and boys have the same angle of view to their faces - maybe this brokes detection?

I've the same situation but with one BIG difference - the people looks absolutelly different, but grouped to same cluster. How this is possible?

Lubyagov Nickolay said...

Alexander, It does not seem that the mouth and eyes, affect the result. In general, as far as I can see, the results are quite accurate regardless of the facial expressions, which is surprising for me. I used 2 frames, with eyes open, and mouth in different positions, it too grupped in one cluster. https://drive.google.com/open?id=0B3_jMty60ScdTlVXeDkycTVXTlE
Maybe this is the fault of makeup...

Lubyagov Nickolay said...

But basically, all the photos I tried have a good result, with a value of 0.53 (If the value is greater, I have errors).

Davis King said...

There is a certain error rate, especially in the clustering algorithm used in the example program. Sometimes you might have to adjust the threshold or do some other thing. The point of the example program we are talking about here is to educate and be an introductory document that helps the reader begin to understand and work with face recognition systems. It's not a turn-key problem solver that you just run and it does all things for all applications. It's just a jumping off point into the deeper world of face recognition.

Bill Klein said...

I don't think that this is just an issue with the clustering algorithm, but of the false-positive rate when comparing any two given faces. I've noticed in my tests that there is some fraction of faces of different people that will compare with a distance lower than 0.6.

"given two face images, it correctly predicts if the images are of the same person 99.38% of the time."

I think this is the trip-up. I think 99+% is optimistic, but I haven't done systematic testing to get the error rate that I'm seeing (yet)...

Jon Hauris said...

How do we know what the various "face_types" are that are returned from detector.
For example, for:

the FACE_DETECTOR.PY example says:
# Also, the idx tells you which of the face sub-detectors matched. This can be
# used to broadly identify faces in different orientations.

I ran the example and it returned:
7 bounding boxes, scores, & and face_type:1.0, and
1 bounding box, score, & face_type:4.0, and
2 bounding box, score, & face_type:2.0

What do these face_types mean and is there a description of all the potential types etc.

Jon Hauris said...

In the face_detector.py example there is the line:

dets, scores, idx = detector.run(img, 1, -1)

1. Is there more detailed documentation (or can you explain here) what the 3rd argument means, i.e. the -1. Specifically how they effect the output, & how do know how to set it.
2. Also, what does the idx value mean. --> face_type:x.x
Is there a description of what each face_type value, means.
3. How is score calculated? Is there a reference for this?
4. Does the 2nd argument (1 above) upsample the img if set > 1?

The code is well written and documented, I just need more information about these values.
Thank you, Jon

Davis King said...

There are 5 sub-detectors in the default face detector model, each matching a different face orientation. If you run it on some images and plot the outputs you will see the variation in detection pattern as a function of sub-detector.

As for questions about accuracy, the pretrained model gets 99.38% accuracy on the LFW benchmark. It should be noted that LFW is heavily biased towards white adult american public figures. The model was also trained with a large dataset with a similar bias, so this creates an obvious problem for images of people that don't naturally match that distribution. It should still work well in many cases, but likely would require a different threshold. It should also be noted that usually when you make a detector for a specific person you will train something like a linear SVM on the 128D face vectors. The example program in this blog post uses something more akin to a k-nearest-neighbor method because it makes for a fun example, not because it's the best thing to do in all cases.

Moreover, even in the good case of data like LFW, a 99.38% accuracy doesn't mean there won't be mistakes. For instance, if you have 30 images and you want to compare them all to each other, you have to make 435 comparisons because that's how many pairs there are if you have 30 things. Therefore, 1-0.9938^435 is the probability that at least one of those comparisons makes a mistake. This is 93.33%, so very likely.

So you have to think carefully about how to use a system like this if you want to get good results. Be aware of the details.

Alexander L said...

Hello Davis,

as you can see, your work is of huge interest on communtity :)
Because ot that, more qestions will be ongoing, but we can`t do a really research and answer this questions by itself, because of lack of training database. You already invest a huge amount of time to create that database. Maybe this is the time, we can cooperate all together, because of many additional databases were published, but we don't know (I have already asked this) if some of those are already integrated in your training database.

At the beginning of this blog you wrote "The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6."

Than, ARBaboon wote: "I find tweaking the threshold does wonders. I understand the net was trained to 0.6 but I have better results at 0.45 . This is only an observation."

At the least, Lubyagov Nickolay said "But basically, all the photos I tried have a good result, with a value of 0.53 (If the value is greater, I have errors)."

At the same time you wrote "Training took about a day on an older titan x."

So my question, if it make sense and is possible to you just to provide us with trained network weight values for additional projection radius of 0.45 and 0.53? And may be hor higher dimensional output as 128D?

And of course it would be great, if we can integrate our expirience in already trained network, as example by providing "negative" examples where classification output fails. Is it possible at all to supply "negative match" to training algorithmus in order to improve robustnes of result?

Davis King said...

The Microsoft celeb-1M database is available online and is larger than the dataset I used for training. So nothing is stopping anyone from doing experiments themselves. There is a whole example program that explains how to train new models in the examples folder.

Also, I'm too busy doing other projects to retrain the network. And retraining with different thresholds is not an interesting experiment in any case.

Jon Hauris said...

Davis, in regards to the face_detector.py example, are you saying that dlib.get_front_detector was not trained with an SVM classifier but rather a K-NN type classifier?

Also, is there a listing of the dlib.get_front_detector code and class dlib.fhog_object_detector? I cannot find it in your github. I did find this at: http://dlib.net/python/index.html#dlib.get_frontal_face_detector
But all that says is:

dlib.get_frontal_face_detector() → fhog_object_detector :¶
Returns the default face detector
And does not tell me much about how it works. Same for class dlib.fhog_object_detector.
Thanks, Jon

Davis King said...

I wasn't talking about the detector. That is trained with an SVM.

http://dlib.net/python/index.html has documentation for fhog_object_detector. It lists the methods available and what they do. The code for all this is on github, the python bindings are in the tools/python folder.

Ran Vardimon said...

First of all - thanks for sharing this really great work!
I'm using another face-detector before using your the Face Recognition API. However, some of the detected faces are not found by the dlib detector. Therefore, I've tried using the dlib CNN directly on all crops, without the 68-points shape alignment (I made sure to rescale them to 150x150). From the tests I've made, seems like the recognition accuracy drops down very strongly when not aligning the faces. Is this to be expected? was the Face-CNN trained only on aligned faces detected by the dlib detector?
Thanks,
Ran

Davis King said...

Yes, you have to align the faces. That's very important.

Miles Halter said...

How can I train a classifier on my own dataset?(Faces only)

Davis King said...

Read the blog post. It answers this question.

Terence Liu said...

Hi Davis,

I’ve cropped all the images from Microsoft’s dataset and began training. Yesterday I tried jittered (50 copies) for each image and it took QUITE A WHILE and made the 200 GB dir into a whopping 10 TB one. load_objects_list alone took more than 5 hours. So I instead put jittering (1 copy) into the dnn training code and pushed the jittered image into the images vector and it started training.

dlib compilation was able to recognize cuDNN:


-- Found Intel MKL BLAS/LAPACK library
-- Found CUDA: /usr/local/cuda/8.0.44 (found suitable version "8.0", minimum required is "7.5")
-- Looking for cuDNN install...
-- Building a CUDA test project to see if your compiler is compatible with CUDA...
-- Checking if you have the right version of cuDNN installed.
-- Found cuDNN: /usr/local/cuda/8.0.44/lib64/libcudnn.so

But apparently dnn_metric_learning_on_images_ex is not using GPU as I watched the output of nvidia-smi, which shows no job and memory consumption.


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 0000:04:00.0 Off | 0 |
| N/A 31C P0 27W / 250W | 2MiB / 16276MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+——————————————————————————————————————+


I’ve set N of data_loaders to 1 and monitor the CPU consumption. When I pin the program to one CPU with `taskset -c 0`, one CPU 100% is used, and when I do not pin it, all possible CPUs are used, which would mean CPUs are being used for training, (instead of GPU). Interestingly, I output a line in each trainer.train_one_step step, and using 1 or all CPUs gave me about 1 line output per second. How is more CPUs (28) not leading to a big performance boost? Is it because the mini batch is too small?

If I could get the NVIDIA P100 (16 GB) to work, what number would you recommend for load_mini_batch(5, 5, rnd, objs, images, labels)? Is it only dependent on how many 16 GB can hold?

Terence

Davis King said...

If cmake says it's using cuda then it should be using cuda. You must have something messed up/confused on your system. Maybe you aren't running the program you think you are. I don't know

Also, do the jittering onload. Don't do it ahead of time, since, as you notice, it creates a huge amount of on disk content.

As for other parameters, experiment and see what works best.

Terence Liu said...

Hi Davis,

I got it working. It was a remote submission system where the worker nodes share the same disk space and CUDA libs as the terminal nodes but I compiled it on a terminal node without the actual GPU. Maybe that's the reason? But anyway it works.

Thanks.

Terence Liu said...

Hi Davis,

The training has finished. Just to provide some reference, it took ~5 hrs on a P100 with 4 CPUs to make sure the queue is always full. Increasing the thread count at that N_CPU from 5 (default in program) to 20 helped filling the queue, but I set the thread count to 30, which still did not saturate the CPU usage (250% instead of 400%). I suspect there are other bottlenecks, because the filesystem program mmfsd is running crazy, maybe taking care of the file loading.

I set the steps without apparent progress to 10000 and here are the last few lines of training:

step#: 160776 learning rate: 0.0001 average loss: 0.0694383 steps without apparent progress: 9745
Saved state to face_metric_sync_
done training
num_right: 277
num_wrong: 23

The success rate is 92.3%. Is this, and the average loss about the same as your experimentation? To get a higher value should I decrease the learning rate threshold to a smaller number, or increase the steps without apparent progress?

Thanks,
Terence

Davis King said...

I don't remember what the output was. It doesn't really matter though. You need to evaluate against some benchmark and follow their protocol to see how well you are doing. Only experimenting will determine what works and what doesn't.

Terence Liu said...

I see. Thanks.

Toni Gubern said...

Amazing work with dlib, Davis.

I've got a simple question... ¿How would you get the distances between two centroids?

I want to compare the centroid of a set of descriptors of known faces against the centroid of a set of descriptors of an unknown face. The centroids are calculated by using the knn algorithm.

Regards.

Davis King said...

A centroid is just the average. So you add them all up and divide by the number of faces.

Toni Gubern said...

would this give the distance between two kcentroids?

double dist=centroid1(centroid2);

Davis King said...

This is documented. http://dlib.net/faq.html#Whereisthedocumentationforobjectfunction

Toni Gubern said...

Excelent! Thank you Davis.

Terence Liu said...

Hi Davis,

I noticed the dnn_metric_learning_on_images_ex.cpp uses input_rgb_image as the input layer and the python binding as well as the LFW test suite use input_rgb_image_sized. It lead to serialization problems. Since I have had a couple of models trained, is there a way to convert models with input_rgb_image to those with input_rgb_image_sized?

Terence

Davis King said...

Yeah, that should work. I just pushed the fix to github, so if you get the new code it should do the right thing automatically.

Tsai Joy said...

Hello Davis,
First thanks for the amazing work. Super impressive!

I am using this on real time video, and when tested on a video sequence, the line
std::vector> face_descriptors = net(faces); takes about 700 ticks(0.7 sec)
which is acceptable but is it possible to make face recognition quicker by turning some knobs?

I still want to use the model pre-trained (dlib_face_recognition_resnet_model_v1.dat),
so to my understanding I can't change anything in level and pooling in the loss-metric net.
By the way I am already using Release mode with AVX instructions.

Thanks,
Joy

Davis King said...

Thanks, glad you like dlib :)

You should compile against an optimized BLAS library like the Intel MKL if you want to run on the CPU, or even better, run on the GPU by installing CUDA and cuDNN. If you install either of these things dlib's cmake scripts should automatically find them and build against them.

Tapas said...

Hello Davis,
Thanks for such a wonderful work, which is also helping us in our project.
A small question regarding python api

Can we do something like ?

face_descriptor = facerec.compute_face_descriptor(img)

I don't want to give the shape argument, as the faces are already cropped. I just need decriptor from whole image(which is already a cropped face).

Thanks
Tapas

Davis King said...

The python API doesn't support doing that. You can do it via the C++ API though.

It should be emphasized, that the network expects a certain kind of cropping and alignment. So if you aren't cropping the faces the way the dlib code does it then the face recognition accuracy will suffer.

Tapas said...

Hello Davis,

I do have already cropped and aligned faces stored in the directory. We are processing video streams and we don't want to run face detection on same frame again and again. We are experimenting different parameters of classifier to see it accuracy improves or not. Hope python API had that feature.

Thanks

«Oldest ‹Older   1 – 200 of 232   Newer› Newest»