Table of Contents
Fetching ...

The Devil is in the Tails: Fine-grained Classification in the Wild

Grant Van Horn, Pietro Perona

TL;DR

The paper addresses the challenge of long-tailed distributions in fine-grained classification by constructing realistic long-tail regimes from eBird data and evaluating a state-of-the-art CNN (Inception-v3) under uniform, approximate long-tail, and full long-tail conditions. It finds that abundant data yields excellent accuracy and that adding more classes minimally degrades performance, but scarce data causes steep drops, especially for tail classes. Crucially, transfer learning within a single domain provides negligible benefit to tail classes, indicating little cross-class knowledge transfer from head to tail. The work highlights the need for dedicated low-shot and transfer-learning approaches to address real-world long-tail visual recognition tasks and provides baselines for future comparisons.

Abstract

The world is long-tailed. What does this mean for computer vision and visual recognition? The main two implications are (1) the number of categories we need to consider in applications can be very large, and (2) the number of training examples for most categories can be very small. Current visual recognition algorithms have achieved excellent classification accuracy. However, they require many training examples to reach peak performance, which suggests that long-tailed distributions will not be dealt with well. We analyze this question in the context of eBird, a large fine-grained classification dataset, and a state-of-the-art deep network classification algorithm. We find that (a) peak classification performance on well-represented categories is excellent, (b) given enough data, classification performance suffers only minimally from an increase in the number of classes, (c) classification performance decays precipitously as the number of training examples decreases, (d) surprisingly, transfer learning is virtually absent in current methods. Our findings suggest that our community should come to grips with the question of long tails.

The Devil is in the Tails: Fine-grained Classification in the Wild

TL;DR

The paper addresses the challenge of long-tailed distributions in fine-grained classification by constructing realistic long-tail regimes from eBird data and evaluating a state-of-the-art CNN (Inception-v3) under uniform, approximate long-tail, and full long-tail conditions. It finds that abundant data yields excellent accuracy and that adding more classes minimally degrades performance, but scarce data causes steep drops, especially for tail classes. Crucially, transfer learning within a single domain provides negligible benefit to tail classes, indicating little cross-class knowledge transfer from head to tail. The work highlights the need for dedicated low-shot and transfer-learning approaches to address real-world long-tail visual recognition tasks and provides baselines for future comparisons.

Abstract

The world is long-tailed. What does this mean for computer vision and visual recognition? The main two implications are (1) the number of categories we need to consider in applications can be very large, and (2) the number of training examples for most categories can be very small. Current visual recognition algorithms have achieved excellent classification accuracy. However, they require many training examples to reach peak performance, which suggests that long-tailed distributions will not be dealt with well. We analyze this question in the context of eBird, a large fine-grained classification dataset, and a state-of-the-art deep network classification algorithm. We find that (a) peak classification performance on well-represented categories is excellent, (b) given enough data, classification performance suffers only minimally from an increase in the number of classes, (c) classification performance decays precipitously as the number of training examples decreases, (d) surprisingly, transfer learning is virtually absent in current methods. Our findings suggest that our community should come to grips with the question of long tails.

Paper Structure

This paper contains 13 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: (a) The world is long-tailed. Class frequency statistics in real world datasets (birds, a wide array of natural species, and trees). These are long-tailed distributions where a few classes have many examples and most classes have few. (b) The 4 experimental long tail datasets used in this work. We modeled the eBird dataset (blue curve in (a)) and created four long tail datasets by shifting the modeled eBird dataset down (fewer images) and to the left (fewer species) by different amounts. Classes are split into head and tail groups; images per class in the respective groups decay exponentially. (c) Approximation of a long tail dataset. This approximation allows us to more easily study the effects of head classes on tail class performance.
  • Figure 2: (a) Classification performance as a function of training set size on uniform datasets. A neural network (solid lines) achieves excellent accuracy on these uniform datasets. Performance keeps improving as the number of training examples increases to 10K per class -- each 10x increase in dataset size is rewarded with a 2x cut in the error rate. We also see that the neural net scales extremely well with increased number of classes, increasing error only marginally when 10x more classes are used. Neural net performance is also compared with SVM (dashed lines) trained on extracted ImageNet features. We see that fine-tuning the neural network is beneficial in all cases except in the extreme case of 10 classes with 10 images each. (b) Example misclassifications. Four of the twelve images misclassified by the 10 class, 10K images per class model. Clockwise from top left: Osprey misclassified as Great Blue Heron, Bald Eagle (center of image) misclassified as Great Blue Heron, Cooper's Hawk misclassified as Great Egret, and Ring-billed Gull misclassified as Great Egret.
  • Figure 3: Uniform vs. Natural Sampling -- effect on error. Error plots for models trained with uniform sampling and natural sampling. (a) The overall error of both methods is roughly equivalent, with natural sampling tending to be as good or better than uniform sampling. (b) Head classes clearly benefit from natural sampling. (c) Tail classes tend to have the same error under both sampling regimes.
  • Figure 4: Uniform vs. Natural Sampling -- effect on accuracy. We compare the effect of uniformly sampling from classes vs sampling from their natural image distribution when creating training batches for long tailed datasets, Section \ref{['sec:uniform_vs_natural_sampling']}. We use 30 test images per class, so correct classification rate is binned into 31 bins. It is clear that the head classes (marked as stars) benefit from the natural sampling in both datasets. The tail classes in (a) have an average accuracy of 32.1% and 34.2% for uniform and natural sampling respectively. The tail classes in (b) have an average accuracy of 33.5% and 38.6% for uniform and natural sampling respectively. For both plots, head classes have 1000 images and tail classes have 10 images.
  • Figure 5: Transfer between head and tail in approximate long tail datasets. (a) Head class accuracy is plotted against tail class accuracy as we vary the number of training examples in the head and in the tail for the approximate long tail datasets. Each point is associated with its nearest label. The labels indicate (in base 10) how much training data was in each head class (H) and each tail class (T). Lines between points indicate an increase in either images per head class, or images per tail class. As we increase images in the head class by factors of 10, the performance on the tail classes remains approximately constant. This means that there is a very poor transfer of knowledge from the head classes to the tail classes. As we increase the images per tail class, we see a slight loss in performance in the head classes. The overall accuracy of the model is vastly improved though. (b) Histogram of error rates for a long tail dataset. The same story applies here: the tail classes do not benefit from the head classes. The overall error of the joint head and tail model is 48.6%. See Figure \ref{['fig:histogram_of_error_rates']} for additional details.
  • ...and 3 more figures