Table of Contents
Fetching ...

Extract More from Less: Efficient Fine-Grained Visual Recognition in Low-Data Regimes

Dmitry Demidov, Abduragim Shtanchaev, Mihail Mihaylov, Mohammad Almansoori

TL;DR

The paper tackles fine-grained visual classification under extreme data scarcity, where intra-class variation is high and labeled samples are limited. It introduces AD-Net, an architecture-agnostic framework that fuses augmentation-based feature enrichment with self-distillation across multiple augmented views, optimized via $L_{agg} = L_{main} + \alpha L_{dist}$, with $L_{main}$ as cross-entropy and $L_{dist}$ as KL divergence. Key contributions include a two-branch distillation mechanism operating on different crops within a single model, extensive ablations validating design choices, and demonstrated transferability across CNN and ViT backbones, all while maintaining zero inference cost. The method achieves substantial gains on FGIC benchmarks, notably up to 45% relative improvement at 10% data, surpassing state-of-the-art low-data approaches and offering practical guidance for deployment in data-scarce scenarios.

Abstract

The emerging task of fine-grained image classification in low-data regimes assumes the presence of low inter-class variance and large intra-class variation along with a highly limited amount of training samples per class. However, traditional ways of separately dealing with fine-grained categorisation and extremely scarce data may be inefficient under both these harsh conditions presented together. In this paper, we present a novel framework, called AD-Net, aiming to enhance deep neural network performance on this challenge by leveraging the power of Augmentation and Distillation techniques. Specifically, our approach is designed to refine learned features through self-distillation on augmented samples, mitigating harmful overfitting. We conduct comprehensive experiments on popular fine-grained image classification benchmarks where our AD-Net demonstrates consistent improvement over traditional fine-tuning and state-of-the-art low-data techniques. Remarkably, with the smallest data available, our framework shows an outstanding relative accuracy increase of up to 45 % compared to standard ResNet-50 and up to 27 % compared to the closest SOTA runner-up. We emphasise that our approach is practically architecture-independent and adds zero extra cost at inference time. Additionally, we provide an extensive study on the impact of every framework's component, highlighting the importance of each in achieving optimal performance. Source code and trained models are publicly available at github.com/demidovd98/fgic_lowd.

Extract More from Less: Efficient Fine-Grained Visual Recognition in Low-Data Regimes

TL;DR

The paper tackles fine-grained visual classification under extreme data scarcity, where intra-class variation is high and labeled samples are limited. It introduces AD-Net, an architecture-agnostic framework that fuses augmentation-based feature enrichment with self-distillation across multiple augmented views, optimized via , with as cross-entropy and as KL divergence. Key contributions include a two-branch distillation mechanism operating on different crops within a single model, extensive ablations validating design choices, and demonstrated transferability across CNN and ViT backbones, all while maintaining zero inference cost. The method achieves substantial gains on FGIC benchmarks, notably up to 45% relative improvement at 10% data, surpassing state-of-the-art low-data approaches and offering practical guidance for deployment in data-scarce scenarios.

Abstract

The emerging task of fine-grained image classification in low-data regimes assumes the presence of low inter-class variance and large intra-class variation along with a highly limited amount of training samples per class. However, traditional ways of separately dealing with fine-grained categorisation and extremely scarce data may be inefficient under both these harsh conditions presented together. In this paper, we present a novel framework, called AD-Net, aiming to enhance deep neural network performance on this challenge by leveraging the power of Augmentation and Distillation techniques. Specifically, our approach is designed to refine learned features through self-distillation on augmented samples, mitigating harmful overfitting. We conduct comprehensive experiments on popular fine-grained image classification benchmarks where our AD-Net demonstrates consistent improvement over traditional fine-tuning and state-of-the-art low-data techniques. Remarkably, with the smallest data available, our framework shows an outstanding relative accuracy increase of up to 45 % compared to standard ResNet-50 and up to 27 % compared to the closest SOTA runner-up. We emphasise that our approach is practically architecture-independent and adds zero extra cost at inference time. Additionally, we provide an extensive study on the impact of every framework's component, highlighting the importance of each in achieving optimal performance. Source code and trained models are publicly available at github.com/demidovd98/fgic_lowd.
Paper Structure (26 sections, 4 equations, 6 figures, 9 tables)

This paper contains 26 sections, 4 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: An overview of our proposed pre-processing pipeline. Three random crops of different sizes are generated from each input image and all of them are further augmented with the same set of transformations. The largest cropped region is further used for classification and the other two (mid- and small-size) are used as target and source in a distillation objective. Only random crop is shown as an applied augmentation for simplicity.
  • Figure 2: A proposed architecture. A random crop is applied as an input augmentation for images which are subsequently fed into the same model with shared weights. Further, we compare the feature outputs of different crops with a distillation loss $\mathcal{L}_{dist}$. The aggregated loss $\mathcal{L}_{agg}$ is a combination of the traditional cross-entropy loss $\mathcal{L}_{main}$ and $\mathcal{L}_{dist}$.
  • Figure 3: Model prediction probability distribution over 1000 forward passes with Monte Carlo Dropout. X-axis stands for the amount of Gaussian noise added (see bottom row) and Y-axis is the probability of a predicted class. Each dash ( , , ) represents a single model's prediction, where green is the correct class and others are the Top-2 following classes. Each diamond ( , , ) stands for the mean probability over all predictions.
  • Figure 4: Left: Comparison of test accuracy evolution for our approach and traditionally fine-tuned vanilla ResNet-50 (CUB 10 % dataset). Right: Training loss evolution for both classification and distillation objectives in our AD-Net (ResNet-50, CUB 10 % dataset).
  • Figure 5: The visualisation of difference in feature activation maps between the vanilla ResNet-50 and our AD-Net on the CUB dataset. Red colour - higher activation, blue - lower activation.
  • ...and 1 more figures