Friday, October 7, 2016

Hipsterize Your Dog With Deep Learning

I'm getting ready to make the next dlib release, which should be out in a few days, and I thought I would point out a humorous new example program.  The dog hipsterizer!

It uses dlib's new deep learning tools to detect dogs looking at the camera. Then it uses the dlib shape predictor to identify the positions of the eyes, nose, and top of the head. From there it's trivial to make your dog hip with glasses and a mustache :)

This is what you get when you run the dog hipsterizer on this awesome image:
Barkhaus dogs looking fancy


Steven said...

Loving it!

Rob Siegel said...

You broke the mold on this one. You made a million dollar app idea and gave it away for free. Does Baxter love it?

Davis King said...

Ha, making apps is boring. I have better things to do :) Baxter is conflicted as always though.

Ed Miller said...

Awesome! It turns out the Dog Hipsterizer works on bears too:


We'll probably end up using your pre-trained model as part of our project to identify the bears of Brooks Falls, Alaska. You can read more about it at our blog:

Davis King said...

Lol, awesome.

It sounds like you also want to do recognition. I've just added some deep learning tooling to dlib for that. You can see the introductory example program for it here: and a more advanced example here:

I've used that tooling to make a state-of-the-art face recognition model which I'll post online in a few days too. So it definitely works :)

Ed Miller said...

Yes, you are right, recognition is our aim. We've been roughly following the structure of FaceNet, and so far dlib has met all our needs. With this metric learning example, it looks like we can do the whole project using dlib. That will save us from having to fire up one of the more complicated neural net frameworks. Of course we have no idea if any of these networks will learn to differentiate between individual bears, but we're hopeful.

Thanks a million for dlib and for all the great examples!

Ed Miller said...

Hi Davis,

For the dnn_metric_learning_on_images_ex.cpp example you are working on, do you expect the input images to be face crops where the face has been transformed to be centered? I looked at the johns directory on github, but I'm not 100% sure if any are transformed.

How many "measurements" are there in the example embedding?

What sort of hardware do you run on? I think I've seen a mention of Titan X on some of your examples.

Davis King said...

You could use the metric learning with anything, not just faces. But generally speaking, anything you can do to normalize out irrelevant changes in your inputs is always good. For faces that means aligning them to some standard pose since the pose is irrelevant to identity. The johns in the folder are obviously centered and cropped in the same way. I would in general try to normalize your data as much as possible.

I don't know what you mean by "measurements".

I use a titan x.

Ed Miller said...

Thanks for the feedback!

By "measurements" I mean the embedding dimensionality. For example, I believe the embedding for FaceNet was 128 floats per face. I see now that your FC layer is 128, so I guess that's the same.

Ed Miller said...

By the way, how long did it take to train the dnn_metric_learning_on_images_ex on your Titan X system and how big was the training set (or was it only the examples/johns)?

Davis King said...

Ah, yes, my model is also 128D.

Training took about 2 days and the training dataset is about 3 million images. I suspect you don't have 3 million images of bears but you might be able to get by using the human face model, or by doing fine tuning of the human face model with a smaller bear dataset, or even bootstrap a big dataset from nature videos of bears (and probably dog videos since they are so similar) to train a bear face recognizer.

Also, the johns dataset is trivially small, too small for any practical purpose. It's just there to make the example program runnable and illustrate the API. This is true of all the example programs in dlib. Their purpose is to educate, not to be usable applications.

Ed Miller said...

You're right. We don't have 3 million images of known bears. Pulling them from videos might get us there, but I'll have to make sure we know which bears are in the videos. If it turns out the same individual bear showed up in different sets without our knowledge, it would throw off the training since we would be telling the network it's 2 different bears when in fact it is the same bear. Still, if I use videos from different geographies, I can probably assume they are different bears.

I hadn't thought to try the human face model directly. You never know. I was surprised by the Dog Hipsterizer's ability to work with bears! Although I think the human face embedding working for bears is less likely.

I think transfer learning with tuning is our best chance. I had been planning to use a CNN that was trained for ImageNet and replace the FC layer. Perhaps we will first try with the face model and fine tune.

Do you train the entire metric learning network from random using only faces? If so, starting with the ResNet-34 you trained for ImageNet may be a better starting point since it already has bears in the data set (and other animals).

If you don't want me to pollute your blog comments, you can contact me at ed at hypraptive dot com. :)

Thanks for all your help!

Davis King said...

Well, be prepared to spend a lot of time manually reviewing and fixing whatever dataset you make. Hopefully you can tell the difference between bears with your own eyes or it's going to be hard.

I trained the entire network from scratch in one shot.

I have some doubts that initializing this with an ImageNet trained network will help, but you never know until you try.

Davis King said...

Just posted the face recognition model and example program:

Unknown said...

Hi, Davis! I think I found error in file image_pyramid.h

ptype temp = temp_img[r-2][c] +
temp_img[r-1][c]*4 +
temp_img[r ][c]*6 +
temp_img[r-1][c]*4 + // <--- must be +1
temp_img[r-2][c]; // <--- must be +2

Thank you for your code!

Davis King said...

Oh yeah, good catch. Just fixed it.

krs_j2150 said...

Hi Davis,
We are trying to work on something similar for our school project. We are making a project in which user can try spectacles in real time. We are using dlib for it and dog hipsterizer as a reference code. We did make it but have some doubts. In the dog hipsterizer code, for drawing glasses on image, you used vector of point from what does it represent ( I mean are these some point on glasses like a leftside point and a rightside point or some center points? ; we choose the value for from vector by trail and error :p ).
One more thing, for similarity_transform( find_similarity_transform ) it is working fine, can we do the same for perspective transform ( like when user will turn his/her face shape of glasses should change - should become smaller on rotated side of eye and a little larger on other eye). Do we need to defining something like point of projection or vanishing point?
Sorry if this comment does not make any sense we are little new to dlib. We all loved your work :) Awesome work :)

Davis King said...

Yes, the from points are points on the glasses image. Yes, if you want it to rotate like it would in 3D then you need to do some kind of projective transformation. I don't have a function that does exactly what you want in dlib. So you will have to work that out on your own.

Glad you like dlib though :)

Neon Bear Disco said...

Poor white poodle in the back - his eyes were obscured by his hair and couldn't get hipsterized!
Awesome work!

Davis King said...

Ha, thanks :)

Unknown said...

halo thank you for this post , i make this as a reference to my scientific paper , do you have some scietific paper so i can make it citation for you ?
if you dont have , can you give some reference to some paper work that related to this post ? thank you very much

Davis King said...

Thanks. Please use this, or depending on what you are doing.

Unknown said...

Hi I'm trying to replicate something similar in Python. I loaded mmod_dog_hipsterizer.dat as a face detector model and it worked well. Now I'm trying to load mmod_dog_hipsterizer.dat as a shape_detector to locate the facial landmarks but when using dlib.shape_predictor(predictor_path) to load mmod_dog_hipsterizer.dat I get the following error:
RuntimeError: Error deserializing object of type long
while deserializing a dlib::matrix

Am I missing something here?

Many thanks for the help.

Davis King said...

That's not how file loading works. Look at the code of the dog hipsterizer example program to see the details.

Unknown said...
This comment has been removed by the author.
Unknown said...
This comment has been removed by the author.
Unknown said...

I'm comparing dog hipsterizer example program with face_landmark_detection_ex.cpp and

(Sorry I have virtually no C++ knowledge so it's mostly guessing work)

in face_landmark_detection_ex.cpp we have
// loading the model from the shape_predictor_68_face_landmarks.dat file you gave
// as a command line argument.
shape_predictor sp;
deserialize(argv[1]) >> sp;

and in the equivalent seems to be
predictor_path = sys.argv[1]
predictor = dlib.shape_predictor(predictor_path)
(I played around with the code and loading shape_predictor_68_face_landmarks.dat this way seems to work)

So looking at dnn_mmod_dog_hipsterizer.cpp I assumed loading the shape predictor is done in
shape_predictor sp;
deserialize(argv[1]) >> net >> sp >> glasses >> mustache;

So I tried to load mmod_dog_hipsterizer.dat in the same way but I get that RuntimeError above.

I've no idea what's the difference between deserialize(argv[1]) >> sp; and deserialize(argv[1]) >> net >> sp; perhaps this is where I went wrong? I'm guessing >> means output here, so am I right in thinking deserialize(argv[1]) >> net >> sp means:
deserialize mmod_dog_hipsterizer.dat and output to net object (equivalent to net = dlib.cnn_face_detection_model_v1(sys.argv[1])) then output net to sp? But doing sp = dlib.shape_predictor(net) also gives me errors so I guess this is not the case.

Any help would be much much appreciated.

Unknown said...

In deserialize(argv[1]) >> net >> sp >> glasses >> mustache
does deserialize(argv[1]) here returns 4 variables which each got saved to one of net, sp, etc?

If so which python dlib function should I use to load mmod_dog_hipsterizer.dat file? I've tried most of the model loading classes on and the only one that seems to work is dlib.cnn_face_detection_model_v1. But it's not iterable and does not support indexing.

Davis King said...

There is no python API to do what you are trying to do. You need to write C++ code that does the right thing and call it from python.

Ian Ip said...
This comment has been removed by the author.
Ian Ip said...
This comment has been removed by the author.