Table of Contents
Fetching ...

Rethinking Vision Transformer for Large-Scale Fine-Grained Image Retrieval

Xin Jiang, Hao Tang, Yonghua Pan, Zechao Li

TL;DR

The paper tackles the scalability challenge of fine-grained image retrieval with Vision Transformers by introducing EET, a framework that combines Content-based Token Pruning (CTP) with a discriminative transfer strategy (Discriminative Knowledge Transfer and Discriminative Region Guidance) and hash-code learning. CTP hierarchically prunes non-discriminative tokens based on intermediate token content to gain efficiency, while DKT and DRG preserve and augment fine-grained discriminative power during training without increasing inference cost. Hash learning is performed in two steps via proxy-based optimization, enabling efficient out-of-sample retrieval. Experiments across six FGIR benchmarks show competitive accuracy with substantial latency reductions, demonstrating the practical viability of efficient ViT-based FGIR for large-scale deployments.

Abstract

Large-scale fine-grained image retrieval (FGIR) aims to retrieve images belonging to the same subcategory as a given query by capturing subtle differences in a large-scale setting. Recently, Vision Transformers (ViT) have been employed in FGIR due to their powerful self-attention mechanism for modeling long-range dependencies. However, most Transformer-based methods focus primarily on leveraging self-attention to distinguish fine-grained details, while overlooking the high computational complexity and redundant dependencies inherent to these models, limiting their scalability and effectiveness in large-scale FGIR. In this paper, we propose an Efficient and Effective ViT-based framework, termed \textbf{EET}, which integrates token pruning module with a discriminative transfer strategy to address these limitations. Specifically, we introduce a content-based token pruning scheme to enhance the efficiency of the vanilla ViT, progressively removing background or low-discriminative tokens at different stages by exploiting feature responses and self-attention mechanism. To ensure the resulting efficient ViT retains strong discriminative power, we further present a discriminative transfer strategy comprising both \textit{discriminative knowledge transfer} and \textit{discriminative region guidance}. Using a distillation paradigm, these components transfer knowledge from a larger ``teacher'' ViT to a more efficient ``student'' model, guiding the latter to focus on subtle yet crucial regions in a cost-free manner. Extensive experiments on two widely-used fine-grained datasets and four large-scale fine-grained datasets demonstrate the effectiveness of our method. Specifically, EET reduces the inference latency of ViT-Small by 42.7\% and boosts the retrieval performance of 16-bit hash codes by 5.15\% on the challenging NABirds dataset.

Rethinking Vision Transformer for Large-Scale Fine-Grained Image Retrieval

TL;DR

The paper tackles the scalability challenge of fine-grained image retrieval with Vision Transformers by introducing EET, a framework that combines Content-based Token Pruning (CTP) with a discriminative transfer strategy (Discriminative Knowledge Transfer and Discriminative Region Guidance) and hash-code learning. CTP hierarchically prunes non-discriminative tokens based on intermediate token content to gain efficiency, while DKT and DRG preserve and augment fine-grained discriminative power during training without increasing inference cost. Hash learning is performed in two steps via proxy-based optimization, enabling efficient out-of-sample retrieval. Experiments across six FGIR benchmarks show competitive accuracy with substantial latency reductions, demonstrating the practical viability of efficient ViT-based FGIR for large-scale deployments.

Abstract

Large-scale fine-grained image retrieval (FGIR) aims to retrieve images belonging to the same subcategory as a given query by capturing subtle differences in a large-scale setting. Recently, Vision Transformers (ViT) have been employed in FGIR due to their powerful self-attention mechanism for modeling long-range dependencies. However, most Transformer-based methods focus primarily on leveraging self-attention to distinguish fine-grained details, while overlooking the high computational complexity and redundant dependencies inherent to these models, limiting their scalability and effectiveness in large-scale FGIR. In this paper, we propose an Efficient and Effective ViT-based framework, termed \textbf{EET}, which integrates token pruning module with a discriminative transfer strategy to address these limitations. Specifically, we introduce a content-based token pruning scheme to enhance the efficiency of the vanilla ViT, progressively removing background or low-discriminative tokens at different stages by exploiting feature responses and self-attention mechanism. To ensure the resulting efficient ViT retains strong discriminative power, we further present a discriminative transfer strategy comprising both \textit{discriminative knowledge transfer} and \textit{discriminative region guidance}. Using a distillation paradigm, these components transfer knowledge from a larger ``teacher'' ViT to a more efficient ``student'' model, guiding the latter to focus on subtle yet crucial regions in a cost-free manner. Extensive experiments on two widely-used fine-grained datasets and four large-scale fine-grained datasets demonstrate the effectiveness of our method. Specifically, EET reduces the inference latency of ViT-Small by 42.7\% and boosts the retrieval performance of 16-bit hash codes by 5.15\% on the challenging NABirds dataset.

Paper Structure

This paper contains 36 sections, 20 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: (a) Coarse-grained images: Significant visual differences between images from different categories. (b) Fine-grained images: Large intra-class variations within each row, and small inter-class variations within each column. This unique characteristic poses challenges when transitioning from coarse-grained to fine-grained hashing retrieval.
  • Figure 2: Overview of the proposed framework, which comprises three core components: (1) Content-based Token Pruning (CTP), (2) Discriminative Knowledge Transfer (DKT), and (3) Discriminative Region Guidance (DRG). CTP progressively discards background and low-discriminative tokens to significantly improve the computational efficiency of the Vision Transformer (ViT). The discriminative transfer strategy, consisting of DKT and DRG, enables the efficient ViT to learn highly discriminative hash code representations in a cost-free way. During inference, only the efficient ViT (i.e., with pruned tokens) is employed for hash code generation, thereby maintaining high efficiency.
  • Figure 3: Precision-Recall curves of EET and state-of-the-art methods on the three datasets.
  • Figure 4: The hyper-parameter analysis of $\beta$ on CUB-200-2011 with code bits from 16 to 64.
  • Figure 5: The hyper-parameter analysis of $\sigma$ on CUB-200-2011 with code bits from 16 to 64.
  • ...and 3 more figures