Table of Contents
Fetching ...

Fine-grained Image Retrieval via Dual-Vision Adaptation

Xin Jiang, Meiqi Cao, Hao Tang, Fei Shen, Zechao Li

TL;DR

This work tackles fine-grained image retrieval without full fine-tuning by freezing a large pre-trained backbone and introducing three components: Object-Perceptual Adaptation to emphasize discriminative object regions and contextual background, In-Context Adaptation to lightly modify features via a compact bottleneck, and Discriminative Perceptual Transfer to distill discriminative cues into the encoder. The approach yields state-of-the-art or competitive results on several FGIR benchmarks while using only a tiny fraction of trainable parameters (0.68%), highlighting improved generalization and retrieval efficiency. Extensive experiments validate the effectiveness of ICA and DPT, with visualizations showing more focused attention on discriminative regions. Overall, DVA provides a practical, parameter-efficient pathway for FGIR that leverages pre-training knowledge while adapting to fine-grained distinctions.

Abstract

Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adjusted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA has fewer learnable parameters and performs well on three in-distribution and three out-of-distribution fine-grained datasets.

Fine-grained Image Retrieval via Dual-Vision Adaptation

TL;DR

This work tackles fine-grained image retrieval without full fine-tuning by freezing a large pre-trained backbone and introducing three components: Object-Perceptual Adaptation to emphasize discriminative object regions and contextual background, In-Context Adaptation to lightly modify features via a compact bottleneck, and Discriminative Perceptual Transfer to distill discriminative cues into the encoder. The approach yields state-of-the-art or competitive results on several FGIR benchmarks while using only a tiny fraction of trainable parameters (0.68%), highlighting improved generalization and retrieval efficiency. Extensive experiments validate the effectiveness of ICA and DPT, with visualizations showing more focused attention on discriminative regions. Overall, DVA provides a practical, parameter-efficient pathway for FGIR that leverages pre-training knowledge while adapting to fine-grained distinctions.

Abstract

Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adjusted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA has fewer learnable parameters and performs well on three in-distribution and three out-of-distribution fine-grained datasets.

Paper Structure

This paper contains 22 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: (a) previous fine-tuning methods. (b) our dual-vision adaptation method. Our approach designs the collaborative sample and feature adaptation to exploit category-specific differences. This dual strategy enables model to sustain broad representation capabilities from pre-training data while dynamically adjusting its adaptability to fine-grained data.
  • Figure 2: DVA consists of three essential modules: the Object-Perceptual Adaptation module to enhance the encoder’s ability to focus on discriminative object regions, the In-Context Adaptation module to dynamically refine fine-grained features while suppressing irrelevant background retained in frozen representations, and the Discriminative Perceptual Transfer module to distill discriminative awareness into the encoder, enabling auxiliary-free inference while preserving pre-trained knowledge.
  • Figure 3: Analyses of hyper-parameter $\beta$ on CUB-200-2011.
  • Figure 4: Class activation visualizations on CUB-200-2011. For each sample, we show the input image, the class activation map of the baseline model, and the proposed DVA.