Comparison of 7 image classification APIs for food pictures

David Urbansky
7 min readApr 20, 2020

This article compares 7 online image recognition services in the context of food recognition. In particular, my goal was to find out which service is best suited to recognize and classify the dish you ordered in a restaurant based on a picture you took.

For context, I am the co-founder of the spoonacular recipe API, an online service all about food. We have recently built our own food image detection algorithm and this article is a product of our research into the competitive landscape.

There are plenty of services out there, but I decided to compare the following as they are leaders in the field and have stable APIs:

Amazon Rekognition

Image analysis by Amazon. They do not seem to have a pre-trained food model, so I used their generic tagger. Each classified image comes back with a number of tags and confidences.

Clarifai

An image analysis service that also features a special food model, which I used for testing.

Google Cloud Vision

An image analysis service by Google that also does not come with a pre-trained food model.

Imagga

Another image analysis service without a pre-trained food model. I used their generic tagger.

Microsoft Computer Vision

An image analysis service or (as they call it) “cognitive service” by Microsoft. No food model available.

Watson Visual Recognition

IBM Watson for image analysis by IBM. IBM also created a pre-trained food-specific model that I was able to use in their API.

spoonacular

The spoonacular food API offers a food-specific model specifically trained on our test dataset.

Okay, now that we know our contenders, let’s take a look at what dataset we’re working with.

The 50-Class Food Dataset

Our goal is to build/test a food dish recognizer. That is, we don’t want to recognize single ingredients, such as an apple, milk, or a cup of mushrooms, but rather complex dishes that you would order in a restaurant.

To achieve this, we used the spoonacular food ontology to create a set of 518 dishes and gathered 2,781,306 images in total (over 417 GB in file size). The spoonacular food ontology is rather fine-grained, so many of the dish categories were a bit too specific (e.g. “lemon cookies”), so I reduced it down to 50 common classes with about 300 manually checked images per class.

The final dataset is 50 classes with a total of 15,742 images (4.4 GB in size).

The images are a mixture of high quality, professional photographs (showing the perfect execution of the dish, usually NOT taken in a real restaurant setting) and “real world” images taken by people that actually ordered and received that dish (like pictures taken from the spoonacular food journal). The differences between the two types are often extreme, so I found it to be valuable to have both types in the dataset.

Have a look at the following example classes, “cookies”, “burger”, and “pancakes”:

Image copyright Foodista and Unsplash.

Now, here is the full list of the 50 food categories with images (here in plain text):

50 example images for the 50 classes.

Comparison of Image Classification Services

Now that we know the dataset we’re working with we can test the services with pictures from the dataset. These tests can only give us a rough idea of how well the service works because, as stated earlier, not all of them have a food-specific model. More importantly, they are trained on a completely different taxonomy, which means they might not even know what “bibimbap” is or what “churros” look like.

The goal of this article, however, is to find out which services are well suited for real-world dish recognition without training your own models (as dataset preparation is the really hard part). In this context, it is fair to compare them against an unknown set of images and see what they think the images are.

I classified the same 50 images per class using each service for 2,500 classifications in total. The super long image below shows the top 5 tags/categories that each service assigned the images for each class. The percentage after the classified category is the percentage of images that received that particular tag. For example, Amazon Rekognition classified 98% of the “agnolotti” images as “Food”. Good start.

Additionally, I bolded the category names which we should consider correct for the given images. Since the other taxonomies are not exactly the same as spoonacular’s, we should still count “doughnut” as correct even if the spoonacular class is “donut”.

The spoonacular column is just there for reference — since spoonacular’s classifier was trained using the dataset, the category names always match and the percentage of matches is of course often higher.

Final evaluation table.

The last row in the table shows how many classes in the top 5 can be considered correct. Aside from the expected 50/50 for spoonacular, we can see that the two services with special food-related classifiers, Clarifai and Watson, outperformed the other services dramatically.

In particular, the Watson food classifier seems to have been trained on a more fine-grained taxonomy. This is not surprising, since they say they have 2,000 tags, ranging from specific dishes to broader categories like “sweet” and “delicacy” as well. They even differentiate between “barbecued wing” and “buffalo wing”!

To give you an idea which tags/categories each service assigned to the provided images, here are the top 50 tags for each service (for all 2,500 classified images). You can also download the raw data if you’re interested in seeing it all.

Top 50 Classes for Amazon Rekognition

Amazon answered with 1,029 different tags, which is to be expected for a general classifier. The funniest tags were “T-Rex”, “dynamite”, and “toilet” :)

Top 50 for Amazon Rekognition.

Top 50 Classes for Clarifai

Clarifai answered with a total of 740 different food-specific tags (remember, they have a food-specific model). Looking at the tags with a low frequency we can see that they don’t only use dishes in their model, but also have plain ingredients such as “starfruit”, “watercress”, and even spices like “cumin” in their model.

Top 50 for Clarifai.

Top 50 Classes for Google Cloud Vision

Google only had a generic model, which shows in the poor results. They answered with a total of 1,831 distinct tags, most of them food-related and some controversial ones like “shark fin soup” and “foie gras” as well.

Top 50 for Google Cloud Vision.

Top 50 Classes for Imagga

Imagga answered with a total of 832 distinct tags from their generic model. While most of them were food related, I also got back “concrete”, “snake”, and “winter” at times.

Top 50 for Imagga.

Top 50 Classes for Microsoft Cloud Vision

Microsoft Cognitive Services’ generic model came back with at total of 1,070 distinct tags — most of them food related.

Top 50 for Microsoft Cloud Vision.

Top 50 Classes for IBM Watson

Watson returned 873 distinct food-related tags from the food model.

Top 50 for IBM Watson.

spoonacular Confusion Matrix

For spoonacular, we were able to create an actual confusion matrix. The y-axis shows the tested classes and the x-axis the model’s prediction. The diagonal (top left to bottom right) shows correct classifications.

spoonacular API 50 classes confusion matrix.

The total accuracy of spoonacular’s model is 90%. Most problematic seems to be “baked apple” with only 71% accuracy, while “beer” and “burger” are recognized with 100% accuracy — cheers to that!

Resources and Tools Used

To run all the tests I used the Palladian Java Toolkit. Its wrappers for the cloud services Clarifai, Imagga, Amazon Rekognition, IBM Watson, Google Cloud Vision, and Microsoft Cloud Vision made evaluation much easiser.

Thanks to Björn Hempel for writing his bachelor thesis on this topic, which you can read here.

If you’re interested in more detailed information you can download the raw data (Excel).

Summary

If you want to reliably tag food-related images, you may want to use a service that comes with a pre-trained food model such as Clarifai, Watson, or spoonacular. If you have the time, knowledge, and resources, you can of course create your own dataset and create a custom model. Most online services allow for custom training models, but dataset creation is definitely not to be underestimated.

Also, if you want to play around with the spoonacular dish classifier, I built this demo.

--

--