Since the last dlib release, I've been working on adding easy to use deep metric learning tooling to dlib. Deep metric learning is useful for a lot of things, but the most popular application is face recognition. So obviously I had to add a face recognition example program to dlib. The new example comes with pictures of bald Hollywood action heroes and uses the provided deep metric model to identify how many different people there are and which faces belong to each person. The input images are shown below along with the four automatically identified face clusters:
Just like all the other example dlib models, the pretrained model used by this example program is in the public domain. So you can use it for anything you want. Also, the model has an accuracy of 99.38% on the standard Labeled Faces in the Wild benchmark. This is comparable to other state-of-the-art models and means that, given two face images, it correctly predicts if the images are of the same person 99.38% of the time.
For those interested in the model details, this model is a ResNet network with 29 conv layers. It's essentially a version of the ResNet-34 network from the paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun with a few layers removed and the number of filters per layer reduced by half.
The network was trained from scratch on a dataset of about 3 million faces. This dataset is derived from a number of datasets. The face scrub dataset[2], the VGG dataset[1], and then a large number of images I personally scraped from the internet. I tried as best I could to clean up the combined dataset by removing labeling errors, which meant filtering out a lot of stuff from VGG. I did this by repeatedly training a face recognition model and then using graph clustering methods and a lot of manual review to clean up the dataset. In the end, about half the images are from VGG and face scrub. Also, the total number of individual identities in the dataset is 7485. I made sure to avoid overlap with identities in LFW so the LFW evaluation would be valid.
The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level. The training code is obviously also available, since that sort of thing is basically the point of dlib. You can find all details on training and model specifics by reading the example program and consulting the referenced parts of dlib. There is also a Python API for accessing the face recognition model.
[1] O. M. Parkhi, A. Vedaldi, A. Zisserman Deep Face Recognition British Machine Vision Conference, 2015.
[2] H.-W. Ng, S. Winkler. A data-driven approach to cleaning large face datasets. Proc. IEEE International Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014
466 comments :
«Oldest ‹Older 201 – 400 of 466 Newer› Newest»Hello Davis,
I got it working. I simply created an object of dlib.rectangle by giving the image information as constructor arguments and passed as second argument to facerec.compute_face_descriptor. It working.
Thanks
Hi Davis,
Thanks for this cool stuff,
I wonder about the face descriptor computation time.
My core-i7, SSD, GTX1080 GPU, takes 0.35 sec to extract feature for a single face without any image jittering or augmentation.
Is that normal?
It seems to slow to me somehow for real-time purposes?
That's very slow. You are probably not using cuda, blas, or any other such optimizations. When you compile cmake will print messages telling you what it's doing. You can see if it's using these things.
Hi Davis,
First of all congratulations on your work! Really impressive, and dlib is for me one of the
greatest ML tools around.
I am trying to retrain you model for my type of images (works very well, still I would to train on my own set.). I have close to 500K identities, the problem is that I have 1M images (two per subject). Do you think I still can get a good model even without several images per subject?
Thanks in advance,
Miguel
Thanks, I'm glad you like dlib :)
You can try with only 2 per subject, although I'm pretty sure the resulting model isn't going to be very good. The general consensus in the research community seems to be that you need a lot of within-class examples to learn this kind of model. That's also been my experience as well.
I understand what you're saying, but I will need to give it a shoot since my case is very specific. Other way might be to do some transfer learning, I don't know exactly how I can do it but I will need to take a look.
Anyway, thank you a lot.
Cheers,
Miguel
Hello Davis, I wander, is it possible to "straighten" a detected face using dlib? Here's an example :
https://i.stack.imgur.com/4Y9HD.jpg
I only care about landmarks
You can rotate them upright. But there isn't any 3D face warping in dlib if that's what you are asking about.
instead of http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2
Can I use my own landmark detector dat file which is trained by dlib_shape_detector but detects 90 points not 68?
You can do whatever you want so long as the faces are cropped and aligned the same way as the dlib example shows.
It is no doubt awesome post! Great job Davis KIng! And Thank you!
It is no doubt awesome post! Great job Davis KIng! And Thank you!
Thanks :)
Thank you Davis King! It works pretty good!
I have performed the so called t-SNE on the faceScurb database (5 pictures per person) with Dlib face descriptor. Maybe it would be interesting to someone other, - final picture could be found here https://drive.google.com/file/d/0B7JrJeplhKLveFgyRDlPNHBUR28/view?usp=sharing
Hello Davis,
I would like to train an object detector with 8 classes (dog,cat and some other animals) but i'm willing to execute it with a video as input and so, i would like it to be as quick as possible. I've already tested the train_object_detector.cpp, but it is really slow and decreases the video's frame rate (due to high resolution). Which is the fastest detector i could use? Is there any particular solution you could propose?
Really thanks, i've been working months now on dlib, and still i have many things to learn... :)
The HOG detector in that example is the fastest one available. Also, http://dlib.net/faq.html#Whyisdlibslow
I noticed in my tests that:
1) A face without a mouth visible got detected as a face
2) When comparing the descriptor of said mouthless face, with the descriptor of a face of the same person with a mouse, we still get a very close distance. I.e. it correctly finds them to be the same person!
Am I right in assuming that this is because the mouth data is not used in the descriptor?
(I know that this is not a dlib-specific question and has more to do with the deep learning involved, but I'm not sure where else to ask this. Any tips for other forums?)
Thank you!
No, the whole face crop is used in the computation, including the mouth. The point of this thing is to be robust to all kinds of changes to someone's face that still preserves their identity. So it's good that this works like this.
Hi, first I have to say: great work, Dlib has helped me a lot.
Now for the question, is it possible to identify which image gave which face after the clustering?
Thanks :)
Thanks, I'm glad you like dlib.
Yes, you can find that out. Look at the code. It's trivially available information in the example program.
Hi dear Davis, thank you very much for your works,
The mmod_human_face_detector is a great model. As you mentioned, your face detector for preprocessing of face recognition is HOG-based frontal face detector.
Can I use dnn face detector model in face recognition and have the same performance?
You can use any detector so long as you are able to align the faces the same way. To do this with the CNN model you need to use the 5-point face landmarking model released with the newest dlib. When using the 5-point model you can use either the CNN or HOG face detector and they will both give the same performance.
Dear Davis, Thanks for your quick reply
I used CNN model with 5-point face landmarking, but I have an error:
Error detected in file c:\dlib\dlib\image_transforms/interpolation.h.
Error detected in function struct dlib::chip_details __cdecl dlib::get_face_chip_details(const class dlib::full_object_detection &,const unsigned long,const double).
Failing expression was det.num_parts() == 68.
chip_details get_face_chip_details()
You must give a detection with exactly 68 parts in it.
det.num_parts(): 5
The code is here:
auto shape = sp(img, det);
matrix face_chip;
extract_image_chip(img, get_face_chip_details(shape, 150, 0.25), face_chip);
face = move(face_chip);
matrix face_descriptor = net(face);
Can you help me?
Think about it. You are trying to use a model that wasn't created until dlib 19.7, but you are using an older version of dlib. How can that work?
Hello Davis
i have installed CUDA then compiled dlib and opencv its working complete. i want to know if "dlib.face_recognition_model_v1.compute_face_descriptor" function utilizing CUDA?if not i have to write wrapper for python? or something like that. bcz i have to performance diff before and after installation of CUDA?
Thank you.
dlib.face_recognition_model_v1.compute_face_descriptor uses CUDA.
but there is no speed diff after and before compiled with CUDA? does it mean that CUDA makes not difference in this function?
Look at the CMake output when dlib is built. It will tell you if it's using CUDA or not.
hi Davis,
I found out this line of code: std::vector> face_descriptors = net(faces); takes most time. Each face image will takes around 300ms to convert into face descriptor. Any chance to reduce that?
Thanks.
Be sure to link to the Intel MKL if running on the CPU, or even better, use a fast GPU.
Yeh I compiled with CUDA and it runs faster now. Thanks :)
hi Duc how much time it takes now after compiling with cuda? mine is 170ms after and before cuda but in both cases with Intel MKL how much yours now?
This line of code
std::vector> face_descriptors = net(faces);
takes 0.44 seconds on 17 faces so it is around 23ms per face. My GPU is Nvidia 870M GTX.
Cheers,
which compiler you used? and platform is windows? and how did you compiled with cuda can you tell me?
I compile on Linux.
The standard way to compile is to use Cmake as recommended by Davis. I first run cmake-gui to enable DLIB_USE_CUDA, and other options like DLIB_JPEG_SUPPORT, DLIB_PNG_SUPPORT, etc... After that just run following commands( refer here: http://dlib.net/compile.html )
cd examples
mkdir build
cmake-gui ( this is when you enable CUDA as mentioned above )
cd build
cmake ..
cmake --build . --config Release
Good luck.
ps: you said it takes 170ms, which is per face or on the whole vector of faces?
Hi Thanks for the great work
Can I please know if there is any way to get the names of the person for each of clustered group.
Basically wanted to train say 5 sets of people and recognize them in an image
Hello Davis,
Thanks for such a nice library.
I am facing a problem. In continuing to my comment posted on August 24, 2017 at 3:21 AM.
Once I get the face bounding box from video frame, I am extracting 128D features from bounding box. As I got the bounding box of detected face, I saved the face to disk. Reloading the saved face and again extracting 128D features(using dlib.rectangle on whole face-image)...These features are not matching with previous bounding box features. Why are the features not matching ?
Thanks
Tapas
I'm wondering if anyone has experimented to determine the minimum face resolution that will result in a reliable computation of the descriptor. For example, will a 40x40-pixel face be comparable to faces of higher resolution?
Hi Davis;
in python example, what is the role of shape in line ... compute_face_descriptor(img, shape)? Does the recognition model calculate the descriptor according to five landmarks? if so, is it possible to use another shape models -for example- that find 10 landmarks? in this case, does the distance threshold (0.6) change?
thanks in advance.
The landmarks are only used to align the face before the DNN extracts the face descriptor. How many landmarks you use doesn't really matter.
Hi Davis,
It seems that the size of output layer of the network model is 128 which corresponds to 128D vector. How the 128D vector is extracted from an image for training?
This example program shows how to train the model from images: http://dlib.net/dnn_metric_learning_on_images_ex.cpp.html
Hello Davis, How do I determine which image the "image index" refers to. Specifically:
I am training my own detector and received the following:
"RuntimeError: An impossible set of object labels was detected ..."
1. It said that the problem was "image index 1017". How do I find which image this is referring to in the xml file?
2. It also give the "truth rectangle" and "nearest detection template rect:" with their bounding box params. None of which match any of my bb's. What are these rectangles referring to?
3. Where do I adjust the "match_eps"
Thank you, Jon
Hello Davis;
Do You plan to provide python api for dnn metric learning?
Thanks.
Like this? https://github.com/davisking/dlib/blob/master/python_examples/face_recognition.py
No, I mean python equivalent for http://dlib.net/dnn_metric_learning_ex.cpp.html
No, I'm not going to add that since it's impossible to define the network architecture from python.
Hi Davis,
have you ever tried it on a Jetson TX2. I wonder how fast it would be?
Is there any chance to optimize it on an ARM Cortex A7 (dual core) reaching app. 3-5fps?
Or would you rather say forget it?
Thanks!
I haven't used a Jetson, but that's a very popular way to run these DNNs. Most people find the performance to be quite reasonable.
Mike, I had posted a question about this in the TX2 forum after I did a bit of testing:
https://devtalk.nvidia.com/default/topic/1025670/jetson-tx2/dnn-face-detection-performance-on-tx2/post/5217046/
At least for my particular test (dlib DNN-based face detection on high-res images), it appears that the Jetson TX2 is ~10x slower than a GTX 1070. Please do let us know what you find. :)
Bill, thanks for your feedback. I have measured the execution time for the dnn_face_recognition_ex on a Jetson TX2.
It is a Release compile (though with debug info) using CUDA.
The time is exclusive of loading the data and displaying the images, just the inner execution: 4204 ms.
Is that in line with your findings?
I will get rid of the debug info and play with compiler settings.
Hey Mike,
Firstly, I haven't executed the dnn_face_recognition_ex example specifically. Sorry. Secondly, be sure to do a few iterations of whatever you are trying since the first few (?) may be much slower than the others, due to things being initialized the first time around...
Hi Bill,
I do run the example several times but it does not get any better that 4,2 seconds. It detects 24 faces in total of 4 different guys, so my take is 5-6 fps. Any hints on compiler options to check?
Next, I will compare this result with a standard dual core ARMv7@1GHz, eventually using NEON and VFP support...
Hi Bill,
the same code on an ARM Cortex-A7 @1GHz takes 130 seconds. It is a release compile with -march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -ffast-math -Ofast ...
Sounds like a decent speed-up. :)
Hello Davis,
I am right in assuming that the dnn_face_recogntion_ex as of V19.7 uses the HOG-based frontal face detector and NOT the CNN-based one? So I should be able to expect performance improvements for the landmark extraction by employing NEON?
Yes, that example uses HOG. You could just as easily use the CNN face detector with it though.
Dear Davis:
Recently i've been wanting to implement the OpenMP "#pragma omp parallel for"
in the dlib\matrix\matrix_default_mul.h
specifically like this:
for (long r = lhs_block.top(); r <= lhs_block.bottom(); ++r)
{
for (long c = lhs_block.left(); c<= lhs_block.right(); ++c)
{
const typename EXP2::type temp = lhs(r,c);
#pragma omp parallel for //<----------------inserted OpenMP line here
for (long i = rhs_block.left(); i <= rhs_block.right(); ++i)
{
dest(r,i) += rhs(c,i)*temp;
}
}
}
however when i run examples e.g. dnn_face_recognition_ex.cpp
i don't see multi-core processing (via Intel's VTUNE tool) when the get 128D line runs:
std::vector> face_descriptors = net(faces);
Where should i put the "#pragma omp parallel for" to enable OpenMP multi-core processing in dlib?
Linking dlib with the Intel MKL is a much better approach to get the kind of speed boost you are looking for. The CMake scripts are already setup to do it if you install the MKL.
Hi Davis,
are the dlib-for-ARM improvements by fastfastball now part of dlib-19.7, e.g. SIMD for NEON and threading?
Or would I have to redo the changes for 19.7?
All that stuff is now part of the main dlib codebase, so yes, it's there. You don't need to do anything to get it.
Hi Davis,
Thank you for your great work! It really help me a lot in my project.
Regarding to the accuracy rate of 99.38% on LFW, do you only do the test on 1680 people pictured with more than one photo? The LFW says there are some incorrectly labeled photos. How do you process these photos? Manually correct them or ignore them in your test?
How do you do the recognition testing? Do you calculate all photos' 128D feature and then compare with each other and see whether the distance between same person's photo is less than 0.6?
Thank you in advance!
Kevin
I follow the exact evaluation protocol laid out by the LFW challenge. This file contains the entire test script for the dlib model: http://dlib.net/files/dlib_face_recognition_resnet_model_v1_lfw_test_scripts.tar.bz2. You can run it and see the LFW evaluation outputs.
Hi Davis,
Thank you for your reply!
I have a question about training face model. Could you give me some comments?
If I increase face images from 3 million to 6 million. Then, will the trained model work better to verify person's face? For example, accuracy rate is increased and false positive rate is decreased.
According to your experience, is there an upper boundary for the recognition capability? It means that recognition capability will not increase, although the number of training face image increases.
Best regards,
Kevin
Yes, more data is better. For instance, Google trained a face recognizer on 200 million faces and got great results.
Hello Davis,
Could you please answer how do you forming mini batch?
Do you take some number of different persons and some number of their unique images? For example, you choose 64 persons and for each take 8 images so the size of mini batch will be 512. Or you just take some random images and for one person you have 10 images for second 3 and so on.
Yury Savitskiy
You can make the mini-batches any way you want. To see what I did, refer to the metric learning example program.
Hi thank you davis for your well documented great work
I have encountered a problem, firstly i build dlib with mingw32 and i am using it in Qt, everything is ok, when i use a dlib function it does it job no problem but after i closed the application the process of that application have not been closed, there is still a process named my application.
well i couldn't find anything relevant to the problem
i might add i tested it in windows 7 and windows 10, both of them same
do you have any idea what is it going on?
How to pre-filter images during recognition phase, for liveness detection?
This to avoid using photos shown to the webcamera and get positive recognition. We only want alive people in front of the camera.
Dario
There aren't any functions in dlib that do this, so you will have to roll your own.
Hi, congrats for dlib, it rocks !
Can you be more precise for the dnn used for face comparaison an what it does exactly ? Thanks
The entire network is defined in the example program linked to from this blog post. If you read the example program you will find all the details.
Hi again
i wanted to run baldguys face recognition example but in this line:
faces.push_back(dlib::move(face_chip));
i get error that says 'move is not member of dlib'
i should add example works without 'move' function, I was just wondering, what is 'move' function doing. not using it will reduce accuracy of face recognition?
thank u for your great work
Right, there is no dlib::move(). Happily, that's not what is in the example code. You must have put that there yourself. Get the unmodified example and it will work.
Hi
regarding 'move' function, u mentioned maybe i added that myself
No i didn't put anything new to the example, it was like this:
http://dlib.net/dnn_face_recognition_ex.cpp.html
This is what is in the example:
faces.push_back(move(face_chip));
There is no dlib::move in there.
oooh sorry,
I was mixed up between namespaces................
Hi Davis,
Very new to deep learning, but I'm used to seeing trained models with a .params and .json file. I see yours is a .dat. I'm trying to get this to work on the AWS deeplens and having some trouble. Is there a way to turn the model into a .params and .json format?
That's not how this works. If you want to use dlib's models use dlib.
Hi Davis,
Just open question, I saw that dlib has a repeat funcionality that allows to use much less memory during compilation (not sure if during execution). Is it possible to convert models without this replication layer from the models without it? Specifically, is it possible to convert this face model to use dlib::repeat?
Thanks,
Miguel
repeat is just a convenience. It doesn't make things faster or slower. And it only makes a substantive impact on compile times when using visual studio since visual studio has not so great template compilation in general. For gcc or clang it doesn't matter.
No, there is no conversion.
Hello Davis,
you stated that the trained model has a 99.38% accuracy on the standard LFW face recognition benchmark. Is there a metric how that would translate into FAR/FRR values?
Thanks!
Michael
The FAR and FRR rates you get are going to be heavily dependent on your application and how you use it. So no, there is no general FAR or FRR value. For example, the larger the database of faces you are searching the more likely you are to get a false positive.
Hello Davis!
Im trying reproduce your results on LFW data set. i saw the code you provided, ran him and got the same result. but when i look into the code, i saw that your function is running on get_lfw_pairs() witch return the pairs of images with rectangle for indexing. then it chooses the best_det according to overlap.
what are those rectangle that are coming from get_lfw_pairs()? (the other ones i get they are from the detector...)
Thanks!
If the detector didn't find a face then the box is just the box in the center of the image, which is where the face is nominally supposed to be.
I've been playing with dnn_metric_learning_on_images_ex (having read all the docs / comments) but there are still a few things that I'm not sure about:
- load_mini_batch ensures that the batch doesn't re-include the same person twice. However, when choosing the samples for a given person, it doesn't try to avoid including the same sample twice. Is this by design / ok? Will we run into problems if some of the persons in the training set have fewer than samples_per_id samples available?
- I read that for dlib_face_recognition_resnet_model_v1 you used a mini-batch size of 35x15 instead of 5x5. Is that just for performance reasons or would the results have been significantly different?
Thanks!
The images are jittered, so even if the same image is included it's fine. You could experiment to see if it would be better to avoid duplicates, but I doubt it matters, at least for most datasets.
Yes, the batch size is very significant. Some sizes lead to much higher accuracy models.
Hi
what if i want to classify cars with metric_loss using dnn_metric_learning_images example
do you think i will achieve acceptable accuracy?
It might work great, the only way to know is to try and see what happens though.
Thanks for this great package!
I using face recognition api. I want to know that how it needs only one image for recognition?
or it just recognizes a difference between two image.
Hi
I wonder how Length function calculates difference between two face vectors?
and can i find a accuracy percentage in that number which is less than 1,
for example 0.5 means 90 percent
How can i compute a matching metric between 2 faces ?
Read the blog post and the linked example program, it's literally about answering that question.
Yea i saw that. I would like to make a matching metric. For example i have a standard image named img0 and several others img1 img2 img3. I would like to have a percentage of match between img0 and {img1, img2, img3}. I can't do this with graph clustering.
I made this but i don't known if it's right :
perc_mathing1 = 1-[length(img0-img1)/length(img0)]
perc_mathing2 = 1-[length(img0-img2)/length(img0)]
perc_mathing3 = 1-[length(img0-img3)/length(img0)]
The neural network is trained with 7485 persons. But it can recognize a person which is not is the database.
The tested person will be close to the trained person with she shares the most physical similarities ?
Hi Davis,
I have a question about the 3 million data set. When you said the network was trained on 3 million faces' data set, does it mean that there are 3 million distinct faces or the number of distinct faces is lesser than 3 million? If so, how many are distinct faces out of the 3 million data set?
Thank you.
Marc
3 million images. The number of identities is in this blog post.
Hi Davis,
It is needless to say that your face recognition network works great! However, I am curious to know how did you select the distance threshold (0.6)? Did that parameter affect the training rate or time or accuracy? Also, what is the bound on the norm of the 128-d embedding that you get from an image? The FaceNet (Google team) restrict the norm of the 128-d embedding to be unity. But your network does not give that. I am wondering if you implicitly bound the norm of the 128-d embedding. I look forward to your reply.
I picked 0.6 empirically. It doesn't have any grand significance. 0.6 gave the best results.
I didn't place any bound on the norm. The loss function is constructed in a way that doesn't depend on there being any particular bound. Although, the relative setting of the threshold (the 0.6) and the amount of weight decay implicitly determine the scale of the norm. Look at the documentation for the loss layer for the details.
I couldn't find the documentation for the loss layer in your website. I typed loss layer / metric in the quick search bar of the documentation page, but didn't find anything related to that. The only thing I can see are C++ source codes, which are hard to parse through and understand the math behind your loss metric. Please correct me if I am wrong, I thin you are using the hinge loss with a threshold of 0.6 as described in the paper (http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf)
If this is how your loss metric function is, then did you consider all pairwise image combinations in your dataset, unlike the triplet loss combination chosen by FaceNet paper by Google team? The data size becomes huge when you have to consider all pairwise image combinations.
http://dlib.net/ml.html#loss_metric_
Click on more details and read the big comment about the loss layer.
Hi Davis,
Thanks a lot! I went over the details and it looks like your loss metric is very similar to the Hadsell 2006 paper in that you also force the embeddings of similar images to lie within the distance threshold in addition to forcing the embeddings of dissimilar images to lie outside the threshold. The one last thing is the hard negative image mining of the non-match pairs. How do you decide the N worst non-match pairs? What is the metric here? Do you consider those N pairs among all the non-match pairs whose mutual Euclidean distance is farthest from 0.6?
You sort them by distance and take the pairs that are the most wrong.
I've a simple question. pardon my ignorance.
The face recognition example uses 2 .dat files. 1 for face landmark 1 for dnn.
If I were to replace the dnn model with one that has been trained with 10 million faces, do I also need to rebuild the face landmark file? If so, which example file should I refer?
Thank you.
HI
i was wondering which of 5 point or 68 point face landmark algorithms could give better accuracy in face recognition
i saw you used 5 point in example of face classification and 68 point in code for testing overall accuracy
The documentation has links to all relevant examples for each object. E.g. http://dlib.net/ml.html#shape_predictor_trainer
Also, there is no reason why retraining a face recognition model would invalidate a shape predictor model.
The 5 and 68 point models should give the same face recognition accuracy. I recommend using the 5 point model because it's smaller and faster.
A question about the training database.
What is your opinion of the noise in the database for training? I realized that those large available databases often contain noise (e.g. out of 100 images of ID_person1, some of them are not ID_person1).
Should I spend time to manually remove those noise? If not, to what percentage should those noise be considered as 'within the tolerant'?
Does your 3m faces also contain some noise? Do we know the % of noise?
Thank you.
Hi Davis.
According to their paper, FaceNet training takes 1000-2000 hours. What about yours?
Can you compare your model with the FaceNet?
The quality of the training data is very important. I spent a lot of time fixing errors in my dataset and each round of fixing errors notably improved the resulting model. You should spend time doing this, retraining to see how much the improvement is, and repeating that until you get tired of doing it or until the results stop improving. My resulting dataset is quite accurate.
Training the model in dlib takes about a day on a 1080ti.
How/When to decide a new training is required? Any hint? If we use a classifier (such as one-class SVM or multiclass SVM) on top of your system (in 128x1 space), does it increase the system performance?
Do experiments to find out. If all else fails then you need to retrain.
Hello Davis,
on an ARM based embedded system we have neither CUDA nor AVX nor OpenCL. Wo you encourage looking into employing NEON and VFP units to achieve a worth while speed-up? Or is the structure of the computations unsuitable for either one?
Thanks!
NEON is very good, I would use it. Although the DNN code in dlib doesn't do anything with it currently. You can turn on gcc options that automatically use it and use profile driven optimization and maybe that will get you part of the way there.
Hello Davis,
Thanks for wonderful model and library.
In recent version of dlib 19.9, I used face_clustering.py to clean up a big data set. I meant, I preserved faces of biggest cluster. Now when I try to run dnn_metric_learning_on_images_ex, I ame getting error:
EXCEPTION IN LOADING DATA
jpeg_loader: error while reading /tmp/cluster_sample_5/n005387/face_162.jpg
Please help.
Thanks
hi
i wanted to convert matrix rgd_pixel to opencv Mat with toMat function but i get this error
type is not a member of cv::DataType dlib::rgb_pixel
this error comes from to_open_cv.h
what am i doing wrong?
The distance can be any number >= 0.
Copy the mmod_rects into rectangles.
I have noticed that descriptors of b&w photos of people seem to skew towards being close to descriptors of other b&w photos. I assume that this is due to (at least partially) representation bias in the training set. This makes me wonder what would happen if everything were retrained on the same training set, but with all training data converted to grayscale beforehand. Would we lose the bias without losing any recognition accuracy? Has anyone tried it?
Yes, it's not going to work as well with black and white pictures since it's trained on color pictures. It's likely that it would be somewhat better for black and white images if trained specifically on black and white images. I haven't done this though.
Measuring performance on the Jetson TX2, I found most of the processing time is still spend on face detection. I already did some shortcuts like limiting the pyramid levels to 2 (object_detector) and spatial subsampling the input image by 2 in each direction.
Here are the results in ms for
face detection = 66
5 landmarks = 6
face chip extraction = 2
dnn vector calc = 10
Are you aware of the paper: Compact Convolutional Neural Network Cascade for Face Detection
(https://arxiv.org/ftp/arxiv/papers/1508/1508.01292.pdf)
The authors claim to have found a "new level of performance/speed ratio for the frontal face detection problem".
This would be an excellent candidate for dlib...
I haven't seen that paper, but that kind of cascade is very common and certainly generally improves detector speeds. At some point I'll add a faster version of the CNN face detector to dlib that uses a cascade, but it wouldn't be until the end of the year at the earliest.
Hi Davis,
Your model trained on color images. When testing/comparing faces, I plan to use a color image and corresponding gray valued ones. In other words, for a given face, two (128x1) feature vector will be constructed. The smaller distance between two faces will be accepted as a real distance between faces. What do you think about this working schema?
Hi Davis King
I'm try to find a papers related to what is a target output for training? (How to set 128D value target for training)
Can you reference for me some paper that show it ?
Hi
i encountered an error in face recognition process, it happens when two faces are partially covering each other and i pushback those two inside vector of matrix then the net and ....
but when i process those two face separately no problem
i wonder what could be the problem?
Hi Davis King,
How can I change the distance_threshold and margin for loss_metric_. The easiest way is to change in loss.h. Any other better way to do the same.
Thanks
You can set them by calling the loss_metric_'s constructor.
Very Thanks for prompt reply. I am calling loss_metric_'s constructor in compute_loss_value_and_gradient, but both margin and thresholds not changing. Assigning new values directly to margin and dist_thresh in same function(compute_loss_value_and_gradient) is giving readonly error for these two variables. If doing something wrong then apologies. But still any hint would be highly appreciated
Thanks
Hey there. You've mentioned previously about how to speed up the face detector on C++. Are there any ways to speed it up on python?
Hi Davis,I have been working with face detection and then recognition with your dlib. Detection (I use different method) can yield false positive sometimes. Is there any method to filter it out with the deep metric vector (128 numbers) on recognition stage?
Hi Davis,
Is there a reason why for the Chinese whispers algorithm we are not using the weights between the nodes as the edge weight. While building the graph there is no 'distance' argument being passed ( in edges.push_back(sample_pair(i,j)), so the default edge weight of one is being used for all edges (in both the tools/python and also in c++ example). Shouldn't we pass the edge weight when building the graph?
Did you ever compare performance on the clean vs unclean dataset? How much did cleaning improve final performance or reduce training time?
> I have tried to use dlib to detect anime faces but only work less than 50% of the time.
Richard: Use Nagadomi's face detector. I've used it on thousands of images from my Danbooru2017 collection, and it has good accuracy; it only occasionally selects non-faces, and errors tend to be more like poor cropping.
Hey davis,thanx forthe awesome work.
Can i know the architecture for your 29 layer resnet.the exact architecture.is there any chance of getting the source code of your resnet face recognition model
Cleaning the dataset made a huge difference. It is very worth it to make sure your dataset has correct labels.
But how much difference?
I don't remember how much exactly. But anyone trying to train such a model without spending time cleaning their dataset is wasting their time. The single most important detail of making this kind of thing work is having a good dataset.
Hello Davis,
First of all, thank you very much for releasing such a State-of-the-Art Net (and more generally for all of your work).
I realized that when the face descriptors are computed on rescaled image of LFW by a factor 2 the TPR may drop only by less than 0.5 percent (@FPR ~0.01). The native face size is then around 64x64 before the face-chipization at 250x250.
A before-deep-learning way of thinking would then be : "let's just work at smaller scale and use smaller filters".
I wonder if it would worth the try to learn a ResNet with a 128x128 or even 64x64 entry-layer?
or would the dramatic loss of freedom degrees of such a Resnet prevent to get a good metric?
Did you already try to train such smaller Net?
Thanks again.
Reducing the input resolution is probably going to negatively impact the accuracy. I tried a lot of possible architectures, and the one I posted works the best. I'm sure there is room for improvement though.
hi Davis
I noticed the dnn_metric_learning_on_images_ex.cpp uses input_rgb_image as the input layer and the python binding as well as the LFW test suite use input_rgb_image_sized. It lead to serialization problems. Since I have had a couple of models trained, is there a way to convert models with input_rgb_image to those with input_rgb_image_sized?
Thanks!
You can convert between networks by just saying:
net1 = net2;
That works whenever every layer in net1 is constructable from net2, which is the case here.
Thanks,Davis
Sorry, I'm still a little confused. Do you mean to replace the network of dnn_face_recognition_ex.cpp with the network of dnn_metric_learning_on_images_ex.cpp? Or is it modified in dnn_face_recognition_ex.cpp? Maybe my understanding is not good, I hope you can tell me more details.
thank you
Hello Davis,
I'm using the code in dnn_metric_learning_on_image_ex.cpp to train a face recognition model using about 3M images. Although I set set_iterations_without_progress_threshold(10000) as you recommend, it still terminates quite soon after 20000 steps. How many step did it take you to train the dlib_face_recognition_resnet_model_v1.dat model?
I don't recall how many steps, but training took about a day on a 1080ti. So a lot more than 20,000 steps.
Thanks, Davis. I have another question.
The learning rate is decreased during the training procedure. Right? But how? You use Learning Rate Schedules like learning rate exponential decay or you drop it value when the loss increases? I'm not familiar with C++ so it's hard for me to find the answer in your code.
It's just waiting for the loss to stop decreasing. Then it reduces the learning rate by some user defined multiple. This is all explained in the documentation at length. You don't have to read the code.
Thank you very much for your support. (y)
Hi Davis,
Is it possible to extract features using intermediary layers of the face recognition model? Hence rather than get 128 vector, get a higher dimensional data for some other face related analysis. Thanks
Hi Davis!
On my computer with 6 processors, I run up to 4 threads and everything works fine with one condition.
The face search zone is fixed.
As soon as I complicate the algorithm by changing the search zone depending on the location of the face, an exception occurs in the method
faces = (* dlib :: frontal_face_detector (roiImg));
If in this sophisticated version, following the launch of a thread, set an expectation of completion threads work - (t[i] = thread(f,...); t[i].join;) - the program works without failures.
In my examples, it turns out that the frontal_face_detector is not thread-safe.
Is it so?
Thank you Nikolay.
In general, you can't operate on a single instance of an object with multiple threads without performing some kind of thread synchronization. That applies here as well.
Hi,
The face recognition model is poor with child faces. Is it because your training pictures contained only adults? Can it be improved by using pictures of children?
Thanks :)
Yes, the training dataset is very heavily biased towards adults. If you trained on a large database of children I'm sure the model would be much better.
Dear Davis:
Thanks again for this amazing library!
I ran training on a 2M faces database via dnn_metric_learning_on_images_ex.cpp with the following result:
step#: 110326 learning rate: 0.0001 average loss: 0.0145144 steps without apparent progress: 9926
Saved state to face_metric_sync_
done training
num_right: 300
num_wrong: 0
However when I used it in dnn_face_recognition_ex.cpp it gets this error:
An error occurred while trying to read the first object from the file dlib_face_recognition_resnet_model_v1.dat.
ERROR: Unexpected version found while deserializing dlib::input_rgb_image_sized.
I used dlib HOG face detector to extract face crops to 150 with padding 0.25
Bellow is what changes I made to the training code, specifically in the load_mini_batch function -> //added by me 181103
what could I have done wrong?
void load_mini_batch (
const size_t num_people, // how many different people to include
const size_t samples_per_id, // how many images per person to select.
dlib::rand& rnd,
const std::vector>& objs, // output of load_objects_list()
std::vector>& images,
std::vector& labels,
frontal_face_detector &detector, //added by me 181103
shape_predictor &sp //added by me 181103
)
{
images.clear();
labels.clear();
DLIB_CASSERT(num_people <= objs.size(), "The dataset doesn't have that many people in it.");
std::vector already_selected(objs.size(), false);
matrix image;
for (size_t i = 0; i < num_people; ++i)
{
size_t id = rnd.get_random_32bit_number()%objs.size();
// don't pick a person we already added to the mini-batch
while(already_selected[id])
id = rnd.get_random_32bit_number()%objs.size();
already_selected[id] = true;
for (size_t j = 0; j < samples_per_id; ++j)
{
const auto& obj = objs[id][rnd.get_random_32bit_number()%objs[id].size()];
load_image(image, obj);
/////////////added by me 181103/////////
matrix < rgb_pixel >face_chip;
std::vector det = detector(image);
if (det.size() == 0)
continue;
dlib::full_object_detection shape;
shape = sp(image, det[0]);
extract_image_chip(image, get_face_chip_details(shape, 150, 0.25), face_chip);
images.push_back(std::move(face_chip));
/////////////added by me 181103/////////
labels.push_back(id);
}
}
// You might want to do some data augmentation at this point. Here we do some simple
// color augmentation.
for (auto&& crop : images)
{
disturb_colors(crop,rnd);
// Jitter most crops
if (rnd.get_random_double() > 0.1)
crop = jitter_image(crop,rnd);
}
// All the images going into a mini-batch have to be the same size. And really, all
// the images in your entire training dataset should be the same size for what we are
// doing to make the most sense.
DLIB_CASSERT(images.size() > 0);
for (auto&& img : images)
{
DLIB_CASSERT(img.nr() == images[0].nr() && img.nc() == images[0].nc(),
"All the images in a single mini-batch must be the same size.");
}
}
Hello Davis,
remember I mentioned this paper "Compact Convolutional Neural Network Cascade for Face Detection"
(https://arxiv.org/ftp/arxiv/papers/1508/1508.01292.pdf) to you. Face detection is still the main bottleneck in the whole processing chain of realtime facial recognition:
face detection = 66ms
5 landmarks = 6ms
face chip extraction = 2ms
dnn vector calc = 10ms
Is there any chance we would see the above algorithm in dlib?
Look at the code you are running. You get that error because those two examples use different network definitions. Look at the definitions and look at the error message, it should be very clear what's happening. You have to use compatible network definitions.
As for the other poster's question. I'm not going to make a new face detector any time soon as I'm busy with other projects.
Thanks Davis for the quick reply!
I changed dnn_face_recognition_ex.cpp
input_rgb_image_sized<150>
into
input_rgb_image
and it worked!
The results looks promising.
I will run it against LFW and YTF and see the scores ;)
Hi Tsai Joy! Do you train face recognition resnet from scratch or just fine-tuning? Is there such possibility(fine-tuning)? Interested in your results.
Hello Andrey Zakharoff:
Yes I trained it from scratch with dlib on a 1080 ti via CUDA cudnn with set_iterations_without_progress_threshold(10000) and faces cut to 0.25 padding with dlib HOG MMOD face detector. It took about 12 hrs but the results seemed worse than the original model trained by Davis. I'm still ongoing training with more data images (2M -> 6M). I'll update my results after the training is done.
Hello Tsay, thanx for reply.
Did you take same dataset, which used Davis? (Or partially same?) I wonder is possible to use fine tuning train with additional small dataset, bcz Davis used rather big dataset, but there were mostly celebrities, it means they were in high resolution, but in practice we have small heads in large background picture, so after detection we get low-resolution face.
And interesting, would recognition work with profile faces, if specially trained?
Hello Andrey Zakharoff:
No I didn't use the same data-set, due to mentioned in one of the comments, to use MS-Celeb-1M instead. To my knowing, low resolution faces, low light, and profile (side) faces will affect accuracy. I've never seen anyone do training specifically on profile faces only, most just convert profile faces into front faces. You can reference to this paper "GridFace: Face Rectification via Learning Local Homography Transformations"
Thank you, Tsay, this article is interesting, but too much theoretical: no output code on Github.
In my photo database i have several faces of each my colleges in individual folder(to be exact, 128vector features as binaries preprocessed for speed), and having detected face from camera, program finds the one from the database with minimal distance. How do you think, if I prepare binary average and variance over each folder and look for minimal distance to average, and then verify each of 128 features is inside [average +- variance*3] interval, will I achieve more accuracy? Especialy I upset when somebody's recognition goes to wrong individual folder.
Dear Davis:
I finished training with the following results:
step#: 235857 learning rate: 0.0001 average loss: 0.0124465 steps without apparent progress: 13805
step#: 235984 learning rate: 0.0001 average loss: 0.0120583 steps without apparent progress: 13956
step#: 236112 learning rate: 0.0001 average loss: 0.0114196 steps without apparent progress: 14081
step#: 236239 learning rate: 0.0001 average loss: 0.0123198 steps without apparent progress: 14146
step#: 236366 learning rate: 0.0001 average loss: 0.0123314 steps without apparent progress: 14312
step#: 236494 learning rate: 0.0001 average loss: 0.0141606 steps without apparent progress: 14976
Saved state to face_metric_sync_
done training
get_distance_threshold: 0.6
get_margin: 0.04
num_right: 279
num_wrong: 21
This is 93% (279/300) accuracy on a 6M images dataset, however when I implement it on my 1:N face recognition application,
I get the following results on a image with 8 faces in the same frame:
(The score means "1.0-Euler distance" and Unknown_threshold was set to 0.81)
[28532] Cam:(0), face_0_ID(0), Name:(Unknown), score:(0.81)
[28532] Cam:(0), face_1_ID(0), Name:(Unknown), score:(0.78)
[28532] Cam:(0), face_2_ID(96), Name:(Will), score:(0.85)
[28532] Cam:(0), face_3_ID(58), Name:(Adam), score:(0.82)
[28532] Cam:(0), face_4_ID(11), Name:(Eva), score:(0.83)
[28532] Cam:(0), face_5_ID(0), Name:(Unknown), score:(0.79)
[28532] Cam:(0), face_6_ID(47), Name:(Sarah), score:(0.87)
[28532] Cam:(0), face_7_ID(29), Name:(Henry), score:(0.82)
As you can see, the unknown faces score is very near known faces,
which will increase FAR (False Acceptance Rate).
How can I increase the euler distance for different faces,
and what hyperparameters should I change in the dnn_metric_learning_on_images_ex.cpp to improve accuracy?
(I fiddled with learning rates(intial 0.01 -> terminate at 0.0001) and iterations_without_progress_threshold(set to 15000) but without apparent improvement)
Thanks,
Joy
Dear Andrey Zakharoff:
As you can see in the comment above, I am confronting the similar problem myself.
I used the same method as you: find the minimal distance from the database and give its label name to the face if it's smaller than a given threshold. This results in some false recognition: unknown to known and known to other known. I read in this article that you can train a KNN classifier:
https://github.com/ageitgey/face_recognition/wiki/Face-Recognition-Accuracy-Problems
but I still think this is an overshot if you can make the distance accurate via training in the first place.
Joy
For real, how can I unsubscribe from this thread? I tried many times but new mails always pop up. It's been some months now.
@octf
1.look down this blog page and find the string
Follow-up comments will be sent to "your_email" Unsubscribe
2. Push hipertext "Unsubscribe". Easy.
No that's what I've done. I also tried to unsubscribe from the email.
It just says you are already unsubscribed. Nothing really changes
Dear Tsay Joy, in Issue "Knn classifier #655" of "ageitgey/face_recognition" I wrote my question about using knn method. If you are in the theme, could you answer the question there?
@Tsai Joy
Thanks for your valuable insight in your results after training. I am experiencing the same behavior, that unknown faces score very close to known faces as weil. I did not train the network myself but used the data provided by Davis.
Can you confirm this same behavior in your investigations or do you observe this only with the net trained by yourself?
Did you find a solution for a better separation of known and unknown faces?
There are always going to be failure cases with any model. But what you should also do is train something like a linear svm to recognize a specific person if you want to get better performance for that specific person.
@Davis, Hi Davis, how to train this svm model- one against all, or many in database against all set of unknown?
You train a binary linear svm to decide if it's the person of interest or not.
Hi Davis, do you expect a linear SVM to be advantageous for recognizing a large number of known faces, e.g. 10,000 faces, enrolled in the database? I was thinking of the SVM function of supervised learning to be an integral part of the neural net.
If you know who you want to recognize then it's almost certainly better to train a model like a linear SVM on top of the network output. That's true regardless of any other considerations.
Thanks Davis, could you point me to the right SVM trainer in dlib? My understanding is that I would need to look into svm_multiclass_linear_trainer, as it is not binary decision.
Hello Davis,
You mentioned training your model with 7485 identities.
What I mean to ask is, could a similar model be trained with a low number of samples per individual identity, but still using a large number of images ( > 1 million)?
The dataset I work with has 2-6 pictures for each subject, and I wonder if it is enough to train a robust recognition model.
I understand this is more of a conceptual question than a technical one about the library, but I take your opinion on the matter could be very helpful.
Thanks!
It will probably not work as well if you only have between 2-6 instances of each identity.
Hi Davis, did you see my question regarding the recommended linear SVM classifier? Which one should I look into? we need to distinguish between a greta number of known faces.
I would use http://dlib.net/ml.html#svm_c_linear_dcd_trainer. Train classifiers independently so you don't have to retrain everything when you add one more person. Whichever model gives the largest output wins when you use it.
Dear Davis:
I used the svm_c_linear_dcd_trainer using m(128) with this result:
1 //George01 Same enroll image pushed to 5 samples as svm training label(+1)
0.323433 //Henry01
0.409045 //Henry02
0.425634 //George02
-0.339184 //Peter01
0.3545 //Henry03
0.350297 //Henry04
George, Henry, and Peter are different Asian people which look alike.
The results can see that it can differentiate -0.339184 Peter01 from the enrolled image George01.
Also 0.425634 George02 is indeed the highest score from the SVM same as the enrolled image George01.
However, from your original results in svm_c_linear_dcd_trainer you can make sinc points to nearly 1.0000
and non-sinc points to nearly -1.0000. How can I make same faces to +1.000 and different faces to -1.0000?
I have no idea what you did so I can't say. I have a feeling you didn't check to see what value of C is right to use though. You must read the documentation if you want to know how to use things.
whats the estimate time to train CNN for face detection using CPU?
As an update to the svm_c_linear_dcd_trainer for FR:
I used the disturb_colors from dlib to jitter the enrolled face
(due to our application need to enroll one face with only one image of that person)
and pushed all disturbed color faces into the sample points in vector to train the svm_c_linear_dcd_trainer.
The results looks much better than euler score (1.0 - euler distance):
name: WILL, euler score: 0.6078, SVM score: 0.3353, cosine score: 0.9223
name: WILL, euler score: 0.8007, SVM score: 0.8076, cosine score: 0.9800
name: WILL, euler score: 0.6252, SVM score: 0.3715, cosine score: 0.9286
name: WILL, euler score: 0.7867, SVM score: 0.7762, cosine score: 0.9771
as you can see the euler score used to range from "0.60~0.80"
and now SVM score range from "0.33~0.80"
with this range we can cut the Confidence threshold at e.g. "0.50" and get lower false alarms(lower FAR) e.g. "0.60" & "0.62" these not sure scores will be made to unknown instead of name id "WILL".
However large tests are still needed to verify FAR, I will test on larger enroll databases and update when done testing.
Thanks Davis for all the help!
@Tsai Joy, good morning!
It is most important task to lower False Positive for Unknown, you are right.
I have some misunderstanding:
1) Do you mean Euclidean distance when say "euler distance"?
2)Is it good, that for the same person image- "WILL", you get so high varianced SVM score?
It is better, if for right person ("WILL") we get much higher score, and lower score for wrong (unknown) person(not "WILL", not "JOHN" etc).
Could you specify, please?
Hi Andrey, yes it's Euclidean score not euler, my bad ;)
(Euclidean score = 1.0 - Euclidean distance)
As for the second question, yes there were not-sure regions (Euclidean score 60~70) where the face recognition(FR) had trouble giving correct results. Therefore, I previously skipped these frames completely, due to the inference is a video, and I have many frames of face images to use.
In short, Euclidean score below 60 was set as "Unconfidence" where the inference face is labeled "Unknown" (Unknown means not enrolled in database). In contrast, Euclidean score above 70 will be labeled as the enrolled name in database e.g. "WILL". For Euclidean score between 60~70, false results will occur, i.e. the FR will think it's someone else "John" when it's actually "WILL".
But all these measures are just a work around for Euclidean score, as the SVM score now can give a more definite "-1" as different faces and "+1" as same faces between inference and enrolled. I guess you can say high varianced SVM score (I normalized from -1~1 to 0.00~1.00) is better in FR application, due to differ the "higher known confidence" and "lower unknown Unconfidence".
@Tsay Joy Hi Joy, I use formula Probability=sqrt(1.- Euclidean distance), also this is not real probability, but probability-like value. Actually,in this way I de-linearize output value. Due to the changed slope of the function I get stretched scores near 0. and shrinked where Euclidean distance coming to 1. Don't you think, your SVM does something like this?
Post a Comment