Sunday, August 27, 2017

Vehicle Detection with Dlib 19.5


Dlib v19.5 is out and there are a lot of new features. There is a dlib to caffe converter, a bunch of new deep learning layer types, cuDNN v6 and v7 support, and a bunch of optimizations that make things run faster in different situations, like ARM NEON support, which makes HOG based detectors run a lot faster on mobile devices.

However, the coolest and most requested feature has been an upgrade to the CNN+MMOD object detector to support detecting things with varying aspect ratios. The previous version of the detector required the training data to consist of objects that all had essentially the same aspect ratio. This is fine for tasks like face detection and dog hipsterization, but obviously not as general as you would like.

So dlib v19.5 includes an updated version of the MMOD loss layer that can be used to learn an object detector from a dataset with any mixture of bounding box shapes and sizes. To demo this new feature, I used the new MMOD code to create a vehicle detector, which you can see running on these videos. This detector is trained to find cars moving with you in traffic, and therefore cars where the rear end of the vehicle is visible.




The detector is just as fast as previous versions of the CNN+MMOD detector. For instance, when I run it on my NVIDIA 1080ti I can process 39 frames per second when processing them individually and 93 frames per second when processing them grouped into batches. This assumes a frame size of 928x478.

If you want to run this detector yourself you can check out the new example program that does just that. The detector was trained on a modest dataset of 2217 images, which is also available, as is the training code. Both these new example programs contain a lot of information about training this kind of detector and are worth reading if you want to understand the details involved. However, we can go into a short description here to understand how the detector works.


Take this image as an example. I ran the new vehicle detector on it and plotted the resulting detections as red boxes. So what are the processing steps that go from the raw image to the 6 boxes?  To roughly summarize, they are:
  1. Create an image pyramid and pack the pyramid into one big image. Let's call this the "tiled pyramid"
  2. Run the tiled pyramid image through a CNN. The CNN outputs a new image where bright pixels in the output image indicate the presence of cars.
  3. Find pixels in the CNN's output image with a value > 0. Those locations are your preliminary car detections.
  4. Perform non-maximum suppression on the preliminary detections to produce the final output.
Steps 3 and 4 are pretty straightforward. It's the first two steps that are complicated. So to understand them, let's visualize the outputs of these first two steps. All step 1 does is call dlib::create_tiled_pyramid on the input image to produce this new image:


What's special about this image is that we don't need to worry about scale anymore. That is, suppose we have a detection algorithm that can find cars, but it only knows how to find cars of a certain size. No problem. When you run it on this tiled pyramid image you are going to find each car somewhere in it at the scale your detector expects. Moreover, the tiled pyramid is only about 3.7 times larger than the original image, so processing it instead of the raw image gives us full scale invariance for only a 3.7x increase in computational cost. That's a very reasonable trade. Moreover, tiling it inside a rectangular image makes it very easy to process using normal CNN tooling on a GPU and still get full GPU speeds. 

Now for step 2. The CNN takes the tiled pyramid as input, does a bunch of convolutions, and outputs a new set of images. In the case of our vehicle detector, it outputs 3 new images, each is a detection strength map that gets "hot" in locations likely to contain a vehicle. The reason there are 3 images for the vehicle detector is because there are, roughly, 3 different aspect ratios (tall and skinny e.g. semi trucks, short and wide e.g. sedans, and squarish e.g. SUVs). For purposes of display, I have combined the 3 images into one by taking the pointwise max of the 3 original images.  You can see this combined image below. The dark blue areas are places the CNN is saying "definitely not a vehicle" and the bright red locations are the positions it thinks contain a vehicle.


If we overlay this CNN output on top of the tiled pyramid you can see it's doing the right thing. The cars get bright red dots on them, right in the centers of the cars. Moreover, you can tell that the CNN is only detecting cars at a certain scale. The smaller cars are detected at the top of the pyramid and only as we progress down the pyramid does it begin to detect the larger cars.


After the CNN output is obtained, all the detection code needs to do is threshold the CNN output, find all the hot spots, apply non-max suppression, and output the boxes corresponding to the identified hot spots. And that's it, that's all the CNN+MMOD detector is doing.

On the other hand, describing how the CNN is trained is more complicated.  The code in dlib uses the usual stochastic gradient descent methods. You can see many of the details if you read the dlib DNN example programs.  How deep learning works in general is a big topic, but the thing most interesting here is the MMOD loss layer.  For the gory details on that I refer you to the MMOD paper which explains the loss function.  In the paper it is discussed in the context of networks that are linear in their parameters rather than non-linear in their parameters, as is our CNN here. However, for understanding the loss the difference between linear vs. non-linear is a minor detail. In fact, the loss equations are the same for both cases. The only difference is what kind of optimization algorithms are available for each case.  In the linear parameter case you can write a fancy numeric solver capable of solving the problem in a few minutes, but with a non-linear parameterization you have to resort to brute force SGD and GPUs running for many hours.  

But at a very high level, it's running the entire detection process over and over during training, counting the number of detection mistakes (false alarms, missed detections, and duplicate detections), and back-propagating that error gradient through the CNN until the CNN stops messing up. Also, since the MMOD loss layer is counting mistakes after non-max suppression is applied, it knows that it needs to get the CNN to avoid producing high outputs in parts of the image that won't be suppressed by non-max suppression. This is why you see the dark blue areas of "definitely not a car" surrounding each of the car detections. The CNN has learned that it needs to be very careful on the border between "it's a car" and "it's not a car" to avoid accidentally detecting the same car multiple times. 

This is perhaps easiest to see if we merge the pyramid layers back into the original image. If we make an image where the pixel value is the max over all scales in the pyramid we get this image:


Here you can clearly see the 6 car hotspots and the dark blue areas of "not a car" immediately surrounding them. Finally, overlaying this on the original image gives this wonderful image:




50 comments :

  1. Hi Davis,
    the updated CNN+MMOD detector to support detecting objects with varying aspect ratios IS REALLY COOL! My dataset has objects with different aspect ratios, and I will definitely try the new detector.
    Thanks for this great feature!

    ReplyDelete
  2. How do you generate bounding boxes from the heatmap?

    ReplyDelete
  3. The boxes are centered on the bright spots. The scale (i.e. size) of a box is determined by which level of the pyramid contained the bright spot. The aspect ratio is determined by which of the heatmaps contained the spot in the first place. Recall that the CNN outputs multiple heatmaps, one for each possible aspect ratio.

    ReplyDelete
  4. Got it, yeah, thanks. Great work one more time. Thinking of a python binding for this?

    ReplyDelete
  5. Thanks. Yeah I might make a python binding. We will see.

    ReplyDelete
  6. Great job Davis! Now if only you could make this detect objects of multiple classes in a single pass, that would be the definitive version of MMOD+CNN :)

    ReplyDelete
  7. Yep, that's the next thing I'm doing :)

    ReplyDelete
  8. Hey Davis,
    Great work on the vehicle detection. Would appreciate it if you could provide a sample code for vehicle detection in python.

    ReplyDelete
  9. Greetings Mr. King,

    Thanks a lot for your work and making it available to others.
    I have some questions about your CNN implementations (I'm new to CNN, so probably some silly ones):

    How is that possible for your CNN to take an arbitrary sized input image for processing, where is, AFAIK for example for AlexNet the input resolution should be fixed ?
    I'm not quite understand how output layer returns coordinates of a detection ?

    Could you please point me to code/docs/articles to start digging this topics ?
    Regards,
    Pavel.

    ReplyDelete
  10. Nothing about CNNs requires a fixed sized input so long as there aren't any layers that are "fully connected layers" in them. Anything you read about CNNs should make this clear.

    As for how to get the output coordinates. I'm not sure I can explain it any more simply than what I've already said. You can see the output image from the CNN in the blog post. That image plainly contains bright spots. Those bright spots are where the cars are located.

    ReplyDelete
  11. Thanks a lot for your response. My thoughts on varying image sizes are more or less summarized in first answer here: https://stats.stackexchange.com/questions/188165/lenet-limitation-on-input-size/188166

    That answer states that : "...However, often it is easy to adjust the first layer to make the network (in principle) work with different sized input..." with no further explanation.

    I'm a bit stuck at this point, hence my question.

    Regards,
    Pavel.

    ReplyDelete
  12. Great job as usual. I was wondering how could I enable ARM NEON support for HOG based detectors ?

    ReplyDelete
  13. Thanks. NEON support is just one of the cmake options you can toggle on or off when you compile.

    ReplyDelete
  14. Oops, that's not right. It's not a cmake toggle. For that you put -mfpu=neon as compiler flag, just like you normally would, and dlib will automatically use neon instructions.

    ReplyDelete
  15. This comment has been removed by the author.

    ReplyDelete
  16. This comment has been removed by the author.

    ReplyDelete
  17. I understand the aspect ratio point for creating a pyramid, but how does this apply to the cnn face detector? I assume both use image pyramid but you only illustrated this fully in the (later) car detection example?

    And because of the different aspect ratios, you would have to use a two-step process, one detector and the other shape predictor (https://github.com/davisking/dlib/blob/master/examples/dnn_mmod_find_cars_ex.cpp). But I don't see the shape predictor in the face detection example (https://github.com/davisking/dlib/blob/master/examples/dnn_mmod_face_detection_ex.cpp). The net definitions are exactly the same, how?

    ReplyDelete
  18. None of these things need a shape_predictor. That's only in the example because it makes the bounding boxes look a little nicer and is a nice easy technique to know about. It has nothing to do with different aspect ratios or anything like that. The pyramid also doesn't have anything to do with the box aspect ratios.

    All these networks do is output these detection strength maps. This is true of the face detection CNN, the dog head one, and the cars one. There is a detection strength map for every aspect ratio and whichever is hottest decides the aspect ratio. The face detector only outputs one detection strength map because all it's boxes are square (and it was made before the code supported multiple aspect ratios anyway). The cars detector outputs multiple maps because there are multiple car aspect ratios.

    ReplyDelete
  19. I am interested in converting dlib model to caffe. I am trying to migrate dnn mmode face detctor to caffe:
    1. A caffe layer to build image pyramid. Does this image pyramid must be exactly the same as how dlib built it? I mean the image arrangement or something else?
    2. What else is needed for the migration?

    Thanks

    ReplyDelete
  20. That should be all you need and it probably matters a lot that you do it the same way dlib does.

    ReplyDelete
  21. You also have to ask yourself why you would want to do this in the first place. The only legitimate reason I'm aware of to use the dlib to caffe converter is to run some DNN tool only available for caffe, like some of the DNN visualization tools. If you just want to run the detector then just call dlib.

    ReplyDelete
  22. Dear Davis:
    I'm training a model to detect capacitors on an IC motherboard. I tested the code on a few training images(i.e. 9 images for training.xml and 4 for testing.xml).
    However i get "bad allocation" error after just the first step# in optimization:

    num_overlapped_ignored: 0
    num_additional_ignored: 1
    num_overlapped_ignored_test: 0
    num training images: 9
    num testing images: 4
    dnn_trainer details:
    net_type::num_layers: 21
    net size: 0.0067606MB
    net architecture hash: 53d6dea8baae770fc4ed0b8ed8c88dcd
    loss: loss_mmod (detector_windows:(68x70,67x70), loss per FA:1, loss per miss:1, truth match IOU thresh:0.5, overlaps_nms:(0.1,0.1), overlaps_ignore:(0.5,0.95))
    synchronization file: mmod_cars_sync
    trainer.get_solvers()[0]: sgd: weight_decay=0.0001, momentum=0.9
    learning rate: 0.1
    learning rate shrink factor: 0.1
    min learning rate: 1e-05
    iterations without progress threshold: 50000
    test iterations without progress threshold: 1000
    random_cropper details:
    chip_dims.rows: 350
    chip_dims.cols: 350
    randomly_flip: true
    max_rotation_degrees: 2
    min_object_size: 0.2
    max_object_size: 0.7
    background_crops_fraction: 0.5
    translate_amount: 0.1

    step#: 1 learning rate: 0.1 average loss: 0 steps without apparent progress: 0
    bad allocation

    The size of my training images are 1400*1600 pixels and the capacitors I want to detect is about 50*50 to 100*100. What should I change to overcome the bad allocation error?

    ReplyDelete
  23. Make the mini-batches smaller, either in number of images or the size of each image in the mini-batch.

    ReplyDelete
  24. Thanks Davis it worked after I changed the cropper batch from 87 to 10.
    Been training since yesterday and the loss got from 50.779 to 2.57221 so far.
    Thanks again for the fast reply and all the hard work put in dlib, i'll update in the comments the result of the detector after training is done.

    ReplyDelete
  25. Hi, When I run the dnn_mmod_find_cars_ex I get error "Error while calling cudaOccupancyMaxPotentialBlockSize(&num_blocks,&num_threads,K) in file /home/elian/Documents/app/dlib/dlib/dnn/cuda_utils.h:155. code: 8, reason: invalid device function",
    I suspect that it's because of my low capacity of graphic card, it's Nvidia with 2GB, maybe I'm wrong, What do you think is the reason?. Thanks.

    ReplyDelete
  26. Yes, you probably need a newer GPU.

    ReplyDelete
  27. So I recently purchased a new graphic card (i.e. Nvidia's 1080 ti) and i have some questions when compiling build in cmake.

    1. I installed CUDA 9.0 with cudnn-9.0-v7 latest bin, include, lib folders copied and pasted to CUDA 9.0 folder(C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0). Is this version of CUDA compatible with dlib 19.7?

    2. When using cmake I'm not sure which folder i should give for the CUDA_DIR value. I tried the CUDA 9.0 folder(C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0) but in vain. What directory should i give?

    The error i get in cmake as following:
    C++11 activated.
    Enabling AVX instructions
    CMake Deprecation Warning at C:/dlib-19.7/dlib/CMakeLists.txt:31 (cmake_policy):
    The OLD behavior for policy CMP0023 will be removed from a future version
    of CMake.

    The cmake-policies(7) manual explains that the OLD behaviors of all
    policies are deprecated and that a policy should be set to OLD only under
    specific short-term circumstances. Projects should be ported to the NEW
    behavior and not rely on setting a policy to OLD.


    CMake Warning at C:/dlib-19.7/dlib/CMakeLists.txt:513 (find_package):
    By not providing "FindCUDA.cmake" in CMAKE_MODULE_PATH this project has
    asked CMake to find a package configuration file provided by "CUDA", but
    CMake did not find one.

    Could not find a package configuration file provided by "CUDA" (requested
    version 7.5) with any of the following names:

    CUDAConfig.cmake
    cuda-config.cmake

    Add the installation prefix of "CUDA" to CMAKE_PREFIX_PATH or set
    "CUDA_DIR" to a directory containing one of the above files. If "CUDA"
    provides a separate development package or SDK, be sure it has been
    installed.


    *** cuDNN V5.0 OR GREATER NOT FOUND. DLIB WILL NOT USE CUDA. ***
    *** If you have cuDNN then set CMAKE_PREFIX_PATH to include cuDNN's folder.
    OpenCV not found, so we won't build the webcam_face_pose_ex example.
    Configuring done

    ReplyDelete
  28. Normally when CUDA is installed CMake will find it without any special setup. Maybe CUDA 9 is too new for the version of CMake you have. I would try using CUDA 8 or possibly getting the newest CMake if that isn't already what you have.

    ReplyDelete
  29. Thanks Davis for the advice, after downgrading to CUDA 8 and using the latest CMake I was able to find CUDA with Cmake and compile dlib!
    However I get link error in vs2015 with dlib.lib (the 55 errors I get are these 3 categories):
    LNK2001 unresolved external symbol cudaXXXX dlib.lib(cublas_dlibapi.obj)
    LNK2001 unresolved external symbol cudaXXXX dlib.lib(dlib_generated_cuda_dlib.cu.obj)
    LNK2001 unresolved external symbol cudaXXX dlib.lib(dlib_generated_cusolver_dlibapi.cu.obj)

    My Additional Library Directories are:
    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\lib\x64;C:\dlib-19.7\dlib\external\libjpeg;C:\dlib-19.7\examples\build\dlib_build\Release;%(AdditionalLibraryDirectories)

    And Input Additional Dependencies are:
    dlib.lib;cudnn.lib;
    How should I resolve these link errors? Thanks!

    ReplyDelete
  30. Use cmake to configure your project, it will add the appropriate flags and CUDA runtime libraries. Here is an example http://dlib.net/examples/CMakeLists.txt.html

    ReplyDelete
  31. Sorry for the delayed reply. Caffe is the only dl framework that use plain cuda code so far. cudnn is a blackbox and cannot be tuned by hands. So we prefer caffe. Also in the future we have to consider to move to new platform other than cuda, opencl for example. So do you have any other suggestions? Thanks

    ReplyDelete
  32. This comment has been removed by the author.

    ReplyDelete
  33. Hey Davis,

    I am testing your vehicle detector but I would like to know if it could work with grayscale images. I am trying to train a model but I take approximately 27000 steps and the program does not end. Is this normal? the learning rate is always 0.1 although the train loss if it decreases, is close to 0.02. I have also tried to finish the training program changing the condition of the loop and correctly generated the .dat but it gives me an error when loading it in the "Deep Learning Vehicle Detection" program -> "An error occurred while trying to read the seconds object from the file test.dat ERROR: no more objects were in the file! "

    Thank you very much for your attention.

    ReplyDelete
  34. It should work fine for grayscale images.

    Yes, training takes a long time. Use a GPU and be patient.

    That example program doesn't make a shape_predictor. Just edit the other example you are using to not use the shape_predictor since you didn't train one.

    ReplyDelete
  35. finally finished the training!

    For grayscale images, should something be modified in the dnn_mmod_train_find_cars_ex.cpp code to make it work better?

    In order to create the shape_predictor I think that it is necessary to do it with the example train_shape_predictor_ex.cpp but it is programmed to load .xml with landmarks and specifically designed for faces. With what example or what changes should I make in this example to train a shape_predictor for vehicles?. Because with only the network does not detect the vehicles.

    ReplyDelete
  36. It doesn't matter if they are grayscale.

    train_shape_predictor_ex.cpp is a tutorial that explains how to use the shape predictor. Think of it like an essay. Read the whole thing to understand how to use the tool, then apply the tool to whatever you want. There isn't anything face specific about it, it's just that the essay uses faces as an example.

    ReplyDelete
  37. Great article. Thank you! Do you maybe know or can at least estimate to the order of magnitude, what is the performance of detection (fps) on a standard CPU ? Thank you!

    ReplyDelete
  38. A CPU will be something like 20x slower than a GPU in general.

    ReplyDelete
  39. Davis, as alvays, thank you for the great job! But! Does this type of detector supports some kind of confidence thresholding. I mean, can I sohehow control what level of the network's output is actual detection? For the instance in SSD architecture it is straightforward parameter.

    ReplyDelete
  40. Yes, the output contains a detection_confidence field. All of these things are discussed in the documentation at length.

    ReplyDelete
  41. Hi:
    The speed of detecting is very slow.

    We ran your car detection program(dnn_mmod_find_cars_ex.cpp and dnn_mmod_train_find_cars_ex2.cpp)on GPU(GTX 1080) platform, and used
    macro DLIB_USE_CUDA as well. CUDA version is 9.0, the operating system is win7 and the size of the image selected is 720×912. We would like
    to know why the following problems occurred:
    1, The detection speed was very slow, and the average time used was 850 milliseconds.
    2, Only a part of the vehicles were identified.
    3, How could we modify the net parameter settings to get a fast detection speed?

    The code is shown as below:
    #include
    #include "../dlib/dnn.h"
    #include "../dlib/image_io.h"
    #include "../dlib/gui_widgets.h"
    #include "../dlib/image_processing.h"

    using namespace std;
    using namespace dlib;

    #pragma comment(lib, "cudnn.lib")
    #pragma comment(lib, "dlib-md-mkl-gpu-x64.lib")
    #pragma comment(lib, "cudart.lib")
    #pragma comment(lib, "cusolver.lib")
    #pragma comment(lib, "cublas.lib")
    #pragma comment(lib, "curand.lib")

    // The front and rear view vehicle detector network
    template using con5d = con;
    template using con5 = con;
    template using downsampler = relu>>>>>>>>;
    template using rcon5 = relu>>;
    using net_type = loss_mmod>>>>>>>;

    // ----------------------------------------------------------------------------------------

    int main() try
    {
    net_type net;
    shape_predictor sp;
    deserialize("mmod_front_and_rear_end_vehicle_detector.dat") >> net >> sp;
    matrix img;
    load_image(img, "./img1.jpg");

    image_window win;
    win.set_image(img);
    DWORD dstart = GetTickCount();
    // Run the detector on the image and show us the output.
    for (auto&& d : net(img))
    {
    // We use a shape_predictor to refine the exact shape and location of the detection
    // box. This shape_predictor is trained to simply output the 4 corner points of
    // the box. So all we do is make a rectangle that tightly contains those 4 points
    // and that rectangle is our refined detection position.
    auto fd = sp(img, d);
    rectangle rect;
    for (unsigned long j = 0; j < fd.num_parts(); ++j)
    rect += fd.part(j);

    if (d.label == "rear")
    win.add_overlay(rect, rgb_pixel(255, 0, 0), d.label);
    else
    win.add_overlay(rect, rgb_pixel(255, 255, 0), d.label);
    }
    save_jpeg(img, "E:\\Res.jpg");
    cout << "Tick Times " << GetTickCount() - dstart << endl;
    cout << "Hit enter to end program" << endl;
    cin.get();
    }
    catch (image_load_error& e)
    {
    cout << e.what() << endl;
    cout << "The test image is located in the examples folder. So you should run this program from a sub folder so that the relative path is correct." << endl;
    }
    catch (serialization_error& e)
    {
    cout << e.what() << endl;
    cout << "The correct model file can be obtained from: http://dlib.net/files/mmod_front_and_rear_end_vehicle_detector.dat.bz2" << endl;
    }
    catch (std::exception& e)
    {
    cout << e.what() << endl;
    }

    ReplyDelete
  42. The cuda runtime takes a while to start up. You seem to be timing that. Also, maybe you aren't compiling it correctly. All those #pragma statements don't give me a lot of confidence that you are compiling things correctly. You should use cmake and follow the instructions here: http://dlib.net/compile.html

    Then run this example to see how fast it is:
    http://dlib.net/dnn_mmod_find_cars_ex.cpp.html

    Also: http://dlib.net/faq.html#Whyisdlibslow

    ReplyDelete
  43. If you are detecting objects with three different sliding window aspect ratios, e.g. 30x30, 30x60, and 100x30, then are the three convolutional filters in the last layer 30x30, 30x60, and 100x30, respectively? I'm not sure I understand the connection between the dimensions of the filters in the last layer and the dimensions of the sliding windows.

    ReplyDelete
  44. Yes, the last layer has 3 filters that output 3 different detection channels. The sizes of the filters are all the same though.

    ReplyDelete
  45. Super nice and clear explanation! I was wondering if you could tell us how to generate those heat map images :)

    ReplyDelete
  46. The example program discussed in this blog post (http://dlib.net/dnn_mmod_find_cars_ex.cpp.html) makes the heat maps.

    ReplyDelete
  47. Oh, how could I miss that :( Thanks!

    ReplyDelete
  48. Yeah, you can do whatever you want with it.

    ReplyDelete