Table of Contents
Fetching ...

On Learning Discriminative Features from Synthesized Data for Self-Supervised Fine-Grained Visual Recognition

Zihu Wang, Lingqiao Liu, Scott Ricardo Figueroa Weston, Samuel Tian, Peng Li

TL;DR

The paper tackles the challenge of fine-grained visual recognition (FGVR) in self-supervised learning by generating synthesized data pairs from the latent feature space and perturbing non-discriminative dimensions. A decoder reconstructs images from both original and perturbed latent vectors, while Grad-CAM and a low-variance criterion guide which feature dimensions to perturb, yielding a new contrastive objective $\mathcal{L}_C$ plus a reconstruction term $\mathcal{L}_R$ and a perturbation-based contrastive loss $\mathcal{L}_{C_p}$ in the total loss $\mathcal{L} = \mathcal{L}_C + \alpha \mathcal{L}_R + \nu \mathcal{L}_{C_p}$. Empirical results on multiple FGVR datasets show consistent improvements over strong SSL baselines (e.g., MoCo v2, SimSiam) in linear evaluation and image retrieval, demonstrating improved FGVR feature discrimination. The approach highlights the potential of latent-space data synthesis and targeted perturbations to reduce task-irrelevant information and enhance FGVR representations in a self-supervised setting.

Abstract

Self-Supervised Learning (SSL) has become a prominent approach for acquiring visual representations across various tasks, yet its application in fine-grained visual recognition (FGVR) is challenged by the intricate task of distinguishing subtle differences between categories. To overcome this, we introduce an novel strategy that boosts SSL's ability to extract critical discriminative features vital for FGVR. This approach creates synthesized data pairs to guide the model to focus on discriminative features critical for FGVR during SSL. We start by identifying non-discriminative features using two main criteria: features with low variance that fail to effectively separate data and those deemed less important by Grad-CAM induced from the SSL loss. We then introduce perturbations to these non-discriminative features while preserving discriminative ones. A decoder is employed to reconstruct images from both perturbed and original feature vectors to create data pairs. An encoder is trained on such generated data pairs to become invariant to variations in non-discriminative dimensions while focusing on discriminative features, thereby improving the model's performance in FGVR tasks. We demonstrate the promising FGVR performance of the proposed approach through extensive evaluation on a wide variety of datasets.

On Learning Discriminative Features from Synthesized Data for Self-Supervised Fine-Grained Visual Recognition

TL;DR

The paper tackles the challenge of fine-grained visual recognition (FGVR) in self-supervised learning by generating synthesized data pairs from the latent feature space and perturbing non-discriminative dimensions. A decoder reconstructs images from both original and perturbed latent vectors, while Grad-CAM and a low-variance criterion guide which feature dimensions to perturb, yielding a new contrastive objective plus a reconstruction term and a perturbation-based contrastive loss in the total loss . Empirical results on multiple FGVR datasets show consistent improvements over strong SSL baselines (e.g., MoCo v2, SimSiam) in linear evaluation and image retrieval, demonstrating improved FGVR feature discrimination. The approach highlights the potential of latent-space data synthesis and targeted perturbations to reduce task-irrelevant information and enhance FGVR representations in a self-supervised setting.

Abstract

Self-Supervised Learning (SSL) has become a prominent approach for acquiring visual representations across various tasks, yet its application in fine-grained visual recognition (FGVR) is challenged by the intricate task of distinguishing subtle differences between categories. To overcome this, we introduce an novel strategy that boosts SSL's ability to extract critical discriminative features vital for FGVR. This approach creates synthesized data pairs to guide the model to focus on discriminative features critical for FGVR during SSL. We start by identifying non-discriminative features using two main criteria: features with low variance that fail to effectively separate data and those deemed less important by Grad-CAM induced from the SSL loss. We then introduce perturbations to these non-discriminative features while preserving discriminative ones. A decoder is employed to reconstruct images from both perturbed and original feature vectors to create data pairs. An encoder is trained on such generated data pairs to become invariant to variations in non-discriminative dimensions while focusing on discriminative features, thereby improving the model's performance in FGVR tasks. We demonstrate the promising FGVR performance of the proposed approach through extensive evaluation on a wide variety of datasets.
Paper Structure (34 sections, 8 equations, 6 figures, 8 tables)

This paper contains 34 sections, 8 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The overview of the proposed method. (a) Our method can be incorporated into various existing SSL methods. A decoder is utilized to generate images from both the original feature vector and its perturbed counterpart to form data pairs. The overall loss consists three terms: a conventional contrastive loss, a reconstruction loss (ensuring the decoder evolves with the encoder), and a proposed contrastive loss on the generated pairs. (b) We propose two techniques to identify and perturb non-discriminative features in a feature vector, i.e., features with low variance that fail to effectively separate data and those deemed less important by Grad-CAM induced from the SSL loss.
  • Figure 2: An illustration of data distribution in the feature space of encoders pre-trained by MoCo v2 chen2020improved. Blue and red dots represent feature vectors of two categories' data from 3 fine-grained datasets, CUB-200 wah2011caltech, Stanford Cars krause20133d, and FGVC-Aircraft maji2013fine. $v_{min}$ and $v_{max}$ are the dimensions in the feature space where data has the minimal and maximal variance across the dataset. Probability density curve fitting of each category along each dimension is attached to the corresponding axis. Different classes are separated much better along $v_{max}$ than $v_{min}$.
  • Figure 3: Grad-CAM attention visualized on images. Our proposed method is incorporated into MoCo v2 and SimSiam and compared with them.
  • Figure 4: Generated data pairs on CUB-200, Stanford Cars, and FGVC-Aircraft. The original images are also included.
  • Figure 5: Performance comparison of encoders trained by $\mathcal{L}_C+\alpha\cdot\mathcal{L}_R$ with respect to different $\alpha$ value (green solid line). Top-1 classification accuracy on Stanford Cars is reported. MoCo v2 (red dashed line) and our method (blue dashed line) are included for comparison.
  • ...and 1 more figures