Saturday, September 23, 2017

Fast Multiclass Object Detection in Dlib 19.7

The new version of dlib is out and the biggest new feature is the ability to train multiclass object detectors with dlib's convolutional neural network tooling.  The previous version only allowed you to train single class detectors, but this release adds the option to create single CNN models that output multiple labels.  As an example, I created a small 894 image dataset where I annotated the fronts and rears of cars and used it to train a 2-class detector.  You can see the resulting detector running in this video:

If you want to run the car detector from this video on your own images you can check out this example program.

I've also improved the detector speed in dlib 19.7 by pushing more of the processing to the GPU. This makes the detector 2.5x faster.  For example, running the detector on the 928x478 image used in this example program ran at 39fps in the previous version of dlib, but now runs at 98fps (when run on a NVIDIA 1080ti).

This release also includes a new 5-point face landmarking model that finds the corners of the eyes and bottom of nose:

Unlike the 68-point landmarking model included with dlib, this model is over 10x smaller at 8.8MB compared to the 68-point model's 96MB.  It also runs faster, and even more importantly, works with the state-of-the-art CNN face detector in dlib as well as the older HOG face detector in dlib.  The central use-case of the 5-point model is to perform 2D face alignment for applications like face recognition.  In any of the dlib code that does face alignment, the new 5-point model is a drop-in replacement for the 68-point model and in fact is the new recommended model to use with dlib's face recognition tooling.


  1. Thank Davis King about the library. It helps me in my work.

  2. Great new stuff. You say that the "new 5-point model is a drop-in replacement for the 68-point model and in fact is the new recommended model to use with dlib's face recognition tooling." However, two questions:

    - Is it recommended because the results are better or just because it's faster/lightweight?

    - I know you say that it is a drop-in replacement, but does that mean that a face aligned in with the 68-point model can be compared directly (distance between descriptors) to a face aligned with the 5-point model without fear of any issues?


  3. The results should in general be the same, but it's faster and smaller. The alignment should actually be slightly more accurate in general, but not by a lot. The real benefit is speed, size, and ability to use it with the CNN face detector in addition to the HOG detector.

    Yes, you can just replace the old shape model with the new model in any face recognition code that used the old one and it will work. I specifically made this new model to be a replacement for the old one. It will create the same kind of alignment as the old model and work with the previously trained face recognition model.

  4. Hello Davis King,

    I was trying to compile the new release of dlib and I am having some inconvenients that I want to share with you.

    Compiling on Windows
    I used "dnn_face_recognition_ex.cpp" as test code. I had no problem compiling it using dlib-19.3 and dlib-19.4 in Visual Studio 2015 with cuda 8, but with dlib-19.7 I had the following errors:

    1) dlib.lib(gpu_data.obj) : error LNK2005: already defined "void __cdecl dlib::memcpy(class dlib::gpu_data &,class dlib::gpu_data const &)" (?memcpy@dlib@@YAXAEAVgpu_data@1@AEBV21@@Z) in dnn_face_recognition_ex.obj

    2) dlib.lib(gpu_data.obj) : error LNK2005: already defined "public: void __cdecl dlib::gpu_data::set_size(unsigned __int64)" (?set_size@gpu_data@dlib@@QEAAX_K@Z) in dnn_face_recognition_ex.obj

    I tried using cudnn5 and 7 (no diference) and using the CMakeLists.txt in dlib folder from an older version (other errors appeared) that worked correctly for me.

    I was wondering if maybe we have to follow different steps in order to compile this new version, or maybe the minimum requirements of the required software have changed or maybe something happens with Policy CMP0007, because I had a warning that said it was not set.

    Compiling on Linux
    On Linux I had no problem to compile and run dlib-19.3 and 19.4 in the past. Now with dlib-19.7 it appears the old problem of #define DLIB_JPEG_SUPPORT. When I run the cmake it does successfully, I checked if the DLIB_JPEG_SUPPORT was ON and if the code entered (in CMakeLists) in the JPEG FOUND statement and if the libjpeg library was found and all was right. Then the build at Release mode is also made correctly. But when I ran the code I had the problem of unable to load jpeg images because of the DLIB_JPEG_SUPPORT :( This just can be solved if I put a #define DLIB_JPEG_SUPPORT at the top of the cpp code.
    Here I was wondering if something changed compared to previous releases, this is a bit strange to me because I had no problem with them.

    Sorry for this long and boring text and thank you very much for your time and effort :)

  5. Nothing has changed in how dlib is built. You must just be making some kind of mistake. Follow the instructions at the top of this page to compile the example programs: Read the example cmakelists.txt file.

  6. Thank you for your fast answer and for your time. At least now I now that nothing has changed. I keep trying it. Regards!

  7. This comment has been removed by the author.

  8. Hi Davis King, Can you give me some advice about system specification?

  9. Davis,
    Long time user - first time writer. Thanks very much for your code.
    We have built and used dlib in many situations (CPU and GPU) on many systems,
    We are running your classifier as serialized in the code,
    but on one particular Windows box, when we run face_detection (close enough to dnn_mmod_face_detection_ex), we get the following error:
    Error detected at line 682.
    Error detected in file e:\src\9.0-2017\_extrnheaders\dlib\dnn/loss.h.
    Error detected in function void __cdecl dlib::loss_mmod_::to_label,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::input_rgb_image_pyramid >,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,1,void>,classstd::vector >*>(const class dlib::tensor &,const class dlib::dimpl::subnet_wrapper,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::input_rgb_image_pyramid >,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,1,void> &,class std::vector > *,double) const.

    Failing expression was output_tensor.k() == (long)options.detector_windows.size().

    Any hints as to what could be a machine dependency here? This seems to me to be entirely software defined.
    BTW, we are definitely seeing the 25x speedup with the GPU - great job!

  10. Thanks, glad you like dlib :)

    This should definitely not happen and there shouldn't be anything machine specific in the code. If I had to guess I would check if there is something wrong with the GPU that is causing it to output empty tensors, which itself shouldn't happen, but maybe something is horribly wrong with CUDA on that machine.

  11. Thanks very much for the quick response - to help others - I got this message when somebody moved the training file away from the filename we were expecting. So we were trying to classify with an unloaded classifier - dlib was not at fault in anyway

  12. Yeah that's a problem :) You should have gotten an exception though when you tried to read the file.

  13. hey Davis,

    when cmake the dlib there is any way to force looking for cuda, in most of the case the dlib is not build agaist cuda.

    many thanks

  14. CMake looks for cuda by default. There is nothing you need to do to for it to look for it.

  15. Oh, you are right. Thank you for your great work.

  16. Mr. King,

    In your blog post you mentioned that you created a small 894 image dataset and annotated the fronts and rears of cars and used it to train a 2-class detector. Is that dataset available for download?

    I'm interested in taking advantage of the multiclass training and detection that you have implemented, in this iteration of dlib, in my own project.

  17. @Davis -- the new 5-point model looks very robust! Would you be able to elaborate on the shape_predictor_trainer params you used to achieve this?

  18. The exact program that created the model is archived here for posterity:

  19. To find out what dataset was used to create the car detector, read the example:

  20. Hi Devis King,
    I am using dlib from last one year. Thank u for your great work.

    Currently i am using dlib 19.7 for the face detection
    dlib.cnn_face_detection_model_v1('mmod_human_face_detector.dat') through python interface.
    This algorithm giving better accuracy.
    I am using NVIDIA gpu Quadro P5000 16 GB Ram.

    I am sending frame in batches. Each frame is 2.2 MB (3072 X 2048).
    batchdets = detector(imglist, 1,batch_size = 3)
    Upto batch size = 3 ,its working fine. Ones we increase batch size to 4 its giving cuda memory error.
    RuntimeError: Error while calling cudaMalloc(&data, new_size*sizeof(float))

    Can you suggest how can we increase the batch size for speed up.

  21. You don't need to make the batch size bigger. You are already giving it a huge amount of work to do, so much that you are running out of GPU RAM.

  22. Hello Davis,
    Thanks for the wonderful work.

    Regarding previous query by Mr. Amritanshu Sinha, It looks strange than dlib cannot detect faces from a frames batch of more than 3 images(of 6 Mega pixel each) at one time on a GPU with whopping 16GBs of memory? Please clarify.

    Thanks for your time

  23. 3072 X 2048 is a big image. On top of that Amritanshu Sinha upsampled it so it's 4x bigger. I don't know what you expect. It's obviously going to use a lot of RAM to process such a huge image.

  24. Dear Davis,

    Thanks for the reply. We are trying to do face recognition on a outdoor 6MP camera with 12mm lens. More megapixel means more clear faces and we need to process at least 15 fps. Can you suggest any workaround ?


  25. Awesome! Man I would so need an official Android porting of this

  26. @Davis, thanks for the pointer to the training params for the 5 point model. I notice the image flipping you're doing in here as well. Is this trained with the same dataset as the 100MB model from the original implementation?

  27. No, it's an entirely unrelated dataset. I created this new 5 point dataset myself with dlib's imglab tool.

  28. Any details on size of dataset, image size, person and pose variation, etc? I'm always curious what you're using to get such robust performance of the models you create!

  29. It's all documented here:

  30. Hi Davis,

    great work you are doing!

    I have one question: I tried to train a dnn_mmod using another dataset with more than 2 classes but the training fails completely (1 0 0)
    I have a static camera and moving objects away and towards the camera. So the scale of the objects is changing - thought your pyramid input should help here - and also the aspect ratios. The trainer complains a bit about the aspect ratios...
    Nevertheless, do you have a quick tip for me how to target the problem?

    Thanks in advance!

  31. Thanks, glad you like dlib. See

  32. Hi Davis,

    Have you ever heard of FindFace? It's a system made by a russian company NTechLab which allows users to instantly find people on the russian social network VK. They have a database of 500 million photos and they can return extremely accurate results within 2-3 seconds.

    Here is how they work: they have trained a neural network to detect and output 300 facial features. They say that they have 1.5 kilobytes of data per face. Do you think DLib could do this some time in the future?

    Once they have the facial features they store them in a database which can then be easily accessed through indexes.

  33. I haven't heard of FindFace. But if you get your hands on 500 million faces you can certainly train a model using dlib from such data.

  34. Dear Davis:
    After training on 1080 ti with CUDA I get this error:

    Error while calling cudaMalloc(&data, n) in file C:\dlib-19.7\dlib\dnn\cuda_data_ptr.cpp:28. code: 2, reason: out of memory
    PS C:\dlib-19.7\examples>

    I have already changed cropper batch size from 87 to 20, and every 10 mini-batches do a testing mini-batch. What parameters do I need to change additionally to overcome this error?

    The following is the final output steps while training:

    step#: 67164 learning rate: 0.0001 train loss: 0.00216064 test loss: 0.00377144 steps without apparent progress: train=6135, test=989

    done training
    dnn_trainer details:
    net_type::num_layers: 21
    net size: 0.938306MB
    net architecture hash: 53d6dea8baae770fc4ed0b8ed8c88dcd
    loss: loss_mmod (detector_windows:(68x70,67x70), loss per FA:1, loss per miss:1, truth match IOU thresh:0.5, overlaps_nms:(0.1,0.1), overlaps_ignore:(0.5,0.95))
    synchronization file: mmod_cars_sync
    trainer.get_solvers()[0]: sgd: weight_decay=0.0001, momentum=0.9
    learning rate: 1e-05
    learning rate shrink factor: 0.1
    min learning rate: 1e-05
    iterations without progress threshold: 50000
    test iterations without progress threshold: 1000
    random_cropper details:
    chip_dims.rows: 350
    chip_dims.cols: 350
    randomly_flip: true
    max_rotation_degrees: 2
    min_object_size: 0.2
    max_object_size: 0.7
    background_crops_fraction: 0.5
    translate_amount: 0.1

    sync_filename: mmod_cars_sync
    num training images: 9
    training results: 1 0.555556 0.555556
    Error while calling cudaMalloc(&data, n) in file C:\dlib-19.7\dlib\dnn\cuda_data_ptr.cpp:28. code: 2, reason: out of memory
    PS C:\dlib-19.7\examples>

  35. As a follow up I commented out these lines:
    //upsample_image_dataset>(images_train, boxes_train, 1800*1800);
    //upsample_image_dataset>(images_test, boxes_test, 1800*1800);
    and the CUDA memory error bellow was solved:
    Error while calling cudaMalloc(&data, n) in file C:\dlib-19.7\dlib\dnn\cuda_data_ptr.cpp:28. code: 2, reason: out of memory
    PS C:\dlib-19.7\examples>
    FYI my largest image size is 1400x1600 training on 1080 ti. So I guess 1800x1800 is still too high for the limit.

  36. Hi Davis,

    Thank you for your library. I try to apply multiclass CNN detector for OCR purposes. I've found that in some cases orientation of detector windows was changed. I suspect that the reason is a bug in the file loss.h lines 432-433 (442-443):
    if (detector_width < min_target_size)
    detector_width = min_target_size;
    detector_height = min_target_size/ratio;

  37. Yes, the option setup code was a little bit wonky in 19.7. Use the latest dlib from github and it will do the right thing.

  38. Hi Davis,

    I am using dlib "dlib_face_recognition_resnet_model_v1.dat" for the feature extraction. We further wants to train the network with some other data set. Can you suggest any way to loads the weights of model "dlib_face_recognition_resnet_model_v1.dat", so that we further train the model with given initial weights.

    Thanks in advance..


  39. Hi Davis,
    I want to train an object keypoints detector instead of the face landmarks. Accuracy is low for the moment. Bounding box of the detected object is not a square and You say to use find_affine_transform function in shape_predictor.h file but how can I do this for Python module? thanks.

  40. Hi Davis,
    Thank you for all your algorithms to the developers world.

    Have you already released the dataset you used to train the 5 point landmark detector?

  41. Yes, the data is available. See for links and more information about it.

  42. Hi Davis,
    Thank you for your grate library.
    My question is: Is it a way to use your "Fast Multiclass Object Detection" in python?
    meaning that loading model file "mmod_front_and_rear_end_vehicle_detector.dat" in python to predicting the vehicles location?
    If yes, is it possible to run it on GPU?

  43. There are currently no python bindings for that part of dlib.

  44. Hi
    i wanted to test gpu in dlib
    consider i have two built of dlibs one with visual studio 2015 which supports cuda and one mingw53_32 that doesn't support cuda, i ran exactly same code:

    frontal_face_detector detector = get_frontal_face_detector();
    cv::Mat temp = imread("img_1.jpg");
    cv_image img(temp);
    TickMeter timer;
    std::vector faces = detector(img);

    and mingw530_32 give 440.683 miliseconds and visual studio compiler gives 543.65 miliseconds for face detection but when i built visual studio version cmake said dlib will use cuda!!!!
    i don't think it did use cuda because it is even slower.

    how do i be sure dlib uses cuda in visual studio compiler version?

  45. That part of dlib doesn't use cuda. But in general, the output of cmake will tell you in very explicit terms if dlib is compiled to use cuda. If it's compiled to use cuda then it uses cuda as much as it uses it. There is no configuration beyond "built with cuda" or "not built with cuda". So if cmake says it's using cuda then it's using cuda and you are getting whatever cuda acceleration you are going to get.

  46. well thank u for clarifying that, may i ask, will anet_type,loss_metric use cuda for creating face vector?

  47. Yes, the deep learning tools use cuda.

  48. Hi Davis,
    is there any relation between an object detection and an image quality - uncompressed and lossless (BMP), compressed but lossless (PNG) or compressed and lossy (JPEG), etc?
    Thank you.

  49. Hello Davis,

    I am trying to train a landmark model using the Helen 194 points dataset. This dataset is annotated by the authors, so I simply has generated the XML files using the original annotations. I have used the default configuration of the algorithm which is extracted from the original paper. However the results are not accurate. I have modified some parameters of the algorithm such as nu, oversampling and tree_depth, but results are not accurate. Any advice for improving my results? Thank you in advance.

    By the way, dlib is an awesome project.

  50. What does not accurate mean? That it's basically not working at all or that it's just a little bit less accurate than reported in the paper?

  51. I mean that is less accurate than reported in the paper. For example these images:

    Original result from paper

    The model I have trained obtain this result

    I supposed that the problem is on the training phase. Any parameter must be tuned? Any other idea? Thank you in advance.

  52. It should be much better than that. I don't know what the problem is, but you aren't doing something right :)

  53. Ok. I will try to find what the problem is. Thank you.

  54. HI Davis,

    What cases would cause this message "Not enough memory to handle tile data" on a GPU box?

    Thank you!

  55. Loading more data than will fit on your GPU.

  56. Currently looking at train_face_5point_model and the associated data-set dlib_faces_5points.tar... I notice that each file entry has two bounding boxes specified. Am I right in thinking that these are simply the two different bounding boxes detected by the CNN and the HOG detector? Otherwise, what do the two boxes represent? Thanks!

  57. Yes, that's what the boxes are.

  58. Hello Davis

    Nice to see the vehicle detector after face detector :) I would like to know, if we can train dlib to detect different classes - two wheeler, four wheeler, pedestrian.


  59. Yes, you can train whatever you want.

  60. Hi Davis,

    I am trying to do something similar to the sample program: dnn_mmod_face_detection_ex.cpp

    In the sample, the input to the CNN is a matrix object that is allocated on the host. In my code, prior to calling the net, I have some CUDA kernels that preprocess the image, so the image data is already on the GPU. Is there a way to invoke the CNN on the image data without first copying the image data back to the host?

    Also, is there a way to run the CNN in a specified CUDA stream (i.e. the stream I used to run my preprocessing kernels)?

    Thank you,

    Dalei Wang

  61. You could write a custom input layer that takes input from your other source, which shouldn't be a big deal. You can also just call one of the network's member functions that takes a tensor as input rather than a matrix.

    All the network computations run on the default CUDA stream. But you can just use per-thread default streams. Read the CUDA docs for details.

  62. Hi Davis,

    Thank you for your reply. Regarding your suggestion of passing a tensor to the network, my understanding is that the net work is really an object of dlib::loss_mmod templated class. dlib::loss_mmod is itself an alias of an instantiation of dlib::add_loss_layer class. dlib::add_loss_layer class has an operator() that takes a dlib::tensor as input, and that is the function you are referring to. Is my analysis correct?

    Thank you

  63. That's right. You could also pass a tensor to any of the other functions of the immediate sublayer, of which there are many options. But the operator() you mentioned is as good as any.

  64. Hi Davis,

    I just tried using the operator() we spoke about, but I am running into a snag.

    My prototype code, which works, looked like this:

    net_type net;

    //frame is a cv::Mat of type CV_8UC3

    dlib::cv_image cvimg(frame);
    dlib::matrix img;
    dlib::assign_image(img, cvimg);

    auto mmod_rectangles = net(img);

    I would like my productized code to look something like:

    unsigned char* devPtr = ...; // pointer to CUDA memory on GPU, where input image data already lives.

    dlib::resizable_tensor tensorInput;
    tensorInput.set_size(1, 3, h, w); //h, w are height and width of image

    myFancyConversionKernel<<<...>>>(tensorInput.device_write_only(), devPtr);

    //Here, myFancyConversionKernel is responsible for:
    // 1) Convert from uchar8 pixel data to float32 in the range of 0.0f to 255.0f.
    // 2) Deinterleave the RGB channels so that tensorInput contains data in planar format.
    std::vector> mmod_rectangles;
    net(tensorInput, std::back_insert_iterator(mmod_rectangles));

    The new code build, runs, but does not seem to produce any face bounding box (i.e. mmod_rectangles.size() == 1, == 0).

    Is there anything that stands out as incorrect in what I am doing? I am uncertain about converting from uchar8 to float32, since in the prototype code, I did not perform any explicit conversion. I only did this conversion because it seems that dlib::resizable_tensor only supports float32 numerical format.

    Thank you,


  65. Look at the input layer's code. It's not just copying the data to the tensor. You have to replicate its behavior.

  66. I'm not sure what you are referring to as input layer. Is it

    line 308 of

    line 2714 of

    or something else?


  68. I used your model to detect the sample graph you gave. It takes 6 seconds to test a graph with the CPU. Is there any way to make it faster?

  69. Turn on compiler optimizations and link to the Intel MKL.

  70. Hi Davis,

    Can we use varying aspect ratios of bounding box for each label?

  71. Hi Davis!
    Could you share FAR FRR for dlib?

  72. Hey Davis,

    Is the 5 point Landmark Detection implements the same algorithms as the 68 point Landmark Detection Model? Or this one is based on something different.

  73. @sumit perhaps this will helpful

  74. Hi @davis, by any chance does dlib 19.17 use threading to speed up face detection?

  75. I'm writing this here, so other people can see the answer: I just compiled DLib w/ the cuda stuff and cuDNN, etc, and it built fine. Now, what's the way I pass my cv::cuda::GpuMat to dlib to get it to detect? Do I have to call download() on it first, to get it into CPU memory? Would it be faster to use OpenCV's cuda-based hog detector? (because it does take a GpuMat)! anybody have timing comparisons for GPU-based face detection? I'm so confused about which might be better, i'm now looking on NVidia's site for their face-detection stuff.