Table of Contents
Fetching ...

PP-SSL : Priority-Perception Self-Supervised Learning for Fine-Grained Recognition

ShuaiHeng Li, Qing Cai, Fan Zhang, Menghuan Zhang, Yangyang Shu, Zhi Liu, Huafeng Li, Lingqiao Liu

TL;DR

PP-SSL tackles self-supervised FGVR's granularity gap by introducing AIS and IADM. AIS uses a fine-grained text corpus and CLIP-based knowledge distillation to filter irrelevant features, while IADM uses GradCAM from the original image to highlight subtle cues. The total loss combines contrastive learning with AIS and IADM, and inference uses only the image encoder for efficiency. Experiments on seven FGVR datasets show consistent retrieval and classification gains over state-of-the-art SSL methods, underscoring the practical value of the approach.

Abstract

Self-supervised learning is emerging in fine-grained visual recognition with promising results. However, existing self-supervised learning methods are often susceptible to irrelevant patterns in self-supervised tasks and lack the capability to represent the subtle differences inherent in fine-grained visual recognition (FGVR), resulting in generally poorer performance. To address this, we propose a novel Priority-Perception Self-Supervised Learning framework, denoted as PP-SSL, which can effectively filter out irrelevant feature interference and extract more subtle discriminative features throughout the training process. Specifically, it composes of two main parts: the Anti-Interference Strategy (AIS) and the Image-Aided Distinction Module (IADM). In AIS, a fine-grained textual description corpus is established, and a knowledge distillation strategy is devised to guide the model in eliminating irrelevant features while enhancing the learning of more discriminative and high-quality features. IADM reveals that extracting GradCAM from the original image effectively reveals subtle differences between fine-grained categories. Compared to features extracted from intermediate or output layers, the original image retains more detail, allowing for a deeper exploration of the subtle distinctions among fine-grained classes. Extensive experimental results indicate that the PP-SSL significantly outperforms existing methods across various datasets, highlighting its effectiveness in fine-grained recognition tasks. Our code will be made publicly available upon publication.

PP-SSL : Priority-Perception Self-Supervised Learning for Fine-Grained Recognition

TL;DR

PP-SSL tackles self-supervised FGVR's granularity gap by introducing AIS and IADM. AIS uses a fine-grained text corpus and CLIP-based knowledge distillation to filter irrelevant features, while IADM uses GradCAM from the original image to highlight subtle cues. The total loss combines contrastive learning with AIS and IADM, and inference uses only the image encoder for efficiency. Experiments on seven FGVR datasets show consistent retrieval and classification gains over state-of-the-art SSL methods, underscoring the practical value of the approach.

Abstract

Self-supervised learning is emerging in fine-grained visual recognition with promising results. However, existing self-supervised learning methods are often susceptible to irrelevant patterns in self-supervised tasks and lack the capability to represent the subtle differences inherent in fine-grained visual recognition (FGVR), resulting in generally poorer performance. To address this, we propose a novel Priority-Perception Self-Supervised Learning framework, denoted as PP-SSL, which can effectively filter out irrelevant feature interference and extract more subtle discriminative features throughout the training process. Specifically, it composes of two main parts: the Anti-Interference Strategy (AIS) and the Image-Aided Distinction Module (IADM). In AIS, a fine-grained textual description corpus is established, and a knowledge distillation strategy is devised to guide the model in eliminating irrelevant features while enhancing the learning of more discriminative and high-quality features. IADM reveals that extracting GradCAM from the original image effectively reveals subtle differences between fine-grained categories. Compared to features extracted from intermediate or output layers, the original image retains more detail, allowing for a deeper exploration of the subtle distinctions among fine-grained classes. Extensive experimental results indicate that the PP-SSL significantly outperforms existing methods across various datasets, highlighting its effectiveness in fine-grained recognition tasks. Our code will be made publicly available upon publication.

Paper Structure

This paper contains 14 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (a) Overview of our self-supervised framework: By incorporating AIS and IADM during the self-supervised training process, we effectively address the issue of irrelevant feature interference and extract the most detailed discriminative cues from the original images, thereby improving the performance of self-supervised learning in fine-grained recognition tasks. (b) During the inference phase, we remove redundant components, requiring only the output from the image encoder to be applied to downstream tasks, offering enhanced flexibility and convenience.
  • Figure 2: Our AIS utilizes the CLIP image encoder to guide the encoder in generating high-quality features with semantic category understanding. In the diagram, input images are all of the "Cars" category, with one relevant attribute description and other unrelated descriptions in the text. This setup constrains the model to produce high-quality, semantically aware representations.
  • Figure 3: Attention map visualizations on the CUB-200-2011, Stanford Cars, and FGVC Aircraft datasets comparing our method with others. Our method effectively reduces interference from irrelevant features and identifies key parts of the target object.
  • Figure 4: The effectiveness of the proposed IADM is shown via GradCAM visualization, highlighting finer discriminative features identified in the image.
  • Figure 5: Analysis of the text number $N$ in terms of Rank-1 metric (in %) on the CUB-200-2011 Dataset.