Table of Contents
Fetching ...

Enhancing Features in Long-tailed Data Using Large Vision Model

Pengxiao Han, Changkun Ye, Jinguang Tong, Cuicui Jiang, Jie Hong, Li Fang, Xuesong Li

TL;DR

This work tackles long-tailed recognition without relying on language data by leveraging the Segment Anything Model (SAM) to augment visual features. It fuses SAM-derived map and latent features with a ResNet backbone and introduces a prototype-based latent-space loss with a memory-bank of class prototypes to balance head and tail learning. Empirical results on ImageNet-LT and iNaturalist2018 show consistent gains across many-shot, medium-shot, and few-shot categories, with notable improvements when combining SAM fusion and the prototype losses (e.g., All-class accuracy reaching 46.9% on ImageNet-LT with CE and up to 72.6% on iNaturalist2018 with Label Shift). Overall, the approach demonstrates that visual foundation-model features can substantially enhance LT recognition in the absence of linguistic inputs, offering a practical path for robust imbalanced vision systems. The method combines $\mathcal{L}_{\text{head}}$, $\mathcal{L}_{\text{tail-std}}$, and $\mathcal{L}_{\text{tail-dist}}$ into $\mathcal{L}_{\text{proto}}$, which is added to the baseline loss with weight $\beta$, yielding improved generalization across the long-tailed distribution.

Abstract

Language-based foundation models, such as large language models (LLMs) or large vision-language models (LVLMs), have been widely studied in long-tailed recognition. However, the need for linguistic data is not applicable to all practical tasks. In this study, we aim to explore using large vision models (LVMs) or visual foundation models (VFMs) to enhance long-tailed data features without any language information. Specifically, we extract features from the LVM and fuse them with features in the baseline network's map and latent space to obtain the augmented features. Moreover, we design several prototype-based losses in the latent space to further exploit the potential of the augmented features. In the experimental section, we validate our approach on two benchmark datasets: ImageNet-LT and iNaturalist2018.

Enhancing Features in Long-tailed Data Using Large Vision Model

TL;DR

This work tackles long-tailed recognition without relying on language data by leveraging the Segment Anything Model (SAM) to augment visual features. It fuses SAM-derived map and latent features with a ResNet backbone and introduces a prototype-based latent-space loss with a memory-bank of class prototypes to balance head and tail learning. Empirical results on ImageNet-LT and iNaturalist2018 show consistent gains across many-shot, medium-shot, and few-shot categories, with notable improvements when combining SAM fusion and the prototype losses (e.g., All-class accuracy reaching 46.9% on ImageNet-LT with CE and up to 72.6% on iNaturalist2018 with Label Shift). Overall, the approach demonstrates that visual foundation-model features can substantially enhance LT recognition in the absence of linguistic inputs, offering a practical path for robust imbalanced vision systems. The method combines , , and into , which is added to the baseline loss with weight , yielding improved generalization across the long-tailed distribution.

Abstract

Language-based foundation models, such as large language models (LLMs) or large vision-language models (LVLMs), have been widely studied in long-tailed recognition. However, the need for linguistic data is not applicable to all practical tasks. In this study, we aim to explore using large vision models (LVMs) or visual foundation models (VFMs) to enhance long-tailed data features without any language information. Specifically, we extract features from the LVM and fuse them with features in the baseline network's map and latent space to obtain the augmented features. Moreover, we design several prototype-based losses in the latent space to further exploit the potential of the augmented features. In the experimental section, we validate our approach on two benchmark datasets: ImageNet-LT and iNaturalist2018.

Paper Structure

This paper contains 21 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Feature extraction framework of the proposed method. This input image is first processed by the Segment Anything Model (SAM), represented as a frozen Vision Transformer (ViT), to obtain feature output $\mathbf{F}_{SAM}$. Two operations are then applied to utilize the SAM features. In the first approach, Principal Component Analysis (PCA) is used to reduce the input feature channels and output a single-channel feature map. In the second approach, average pooling is applied to generate the SAM feature vector corresponding to the input image.
  • Figure 2: Framework of the proposed method illustrating two feature augmentation methods: feature map fusion and latent feature vector fusion. The extracted feature map from SAM undergoes $1 \times 1$ convolution and is fused with the backbone feature map via element-wise multiplication and addition. Subsequently, the extracted feature latent from SAM is fused with the feature embedding of the backbone after average pooling.
  • Figure 3: Framework for feature latent fusion into the backbone. The SAM and CNN feature vectors are concatenated and processed through a self-attention module, followed by a fully connected (FC) layer. The resulting features are then fused with the original CNN features through element-wise addition to enhance the final representation.
  • Figure 4: Illustration of the proposed prototype-based loss. The figure consists of three loss components: $\mathcal{L}_{\text{head}}, \mathcal{L}_{\text{tail-std}}$ and $\mathcal{L}_{\text{tail-logdist}}$. The head loss, $\mathcal{L}_{\text{head}}$ (black arrows), encourages comparison clustering around the class prototype for head classes. For tail classes, $\mathcal{L}_{\text{tail-std}}$ (orange arrow) minimizes the distance between features and the prototype, while $\mathcal{L}_{\text{tail-logdist}}$ (blue arrow) enhances diversity by pushing features away from the prototype.