Rethinking Vision Transformer for Large-Scale Fine-Grained Image Retrieval
Xin Jiang, Hao Tang, Yonghua Pan, Zechao Li
TL;DR
The paper tackles the scalability challenge of fine-grained image retrieval with Vision Transformers by introducing EET, a framework that combines Content-based Token Pruning (CTP) with a discriminative transfer strategy (Discriminative Knowledge Transfer and Discriminative Region Guidance) and hash-code learning. CTP hierarchically prunes non-discriminative tokens based on intermediate token content to gain efficiency, while DKT and DRG preserve and augment fine-grained discriminative power during training without increasing inference cost. Hash learning is performed in two steps via proxy-based optimization, enabling efficient out-of-sample retrieval. Experiments across six FGIR benchmarks show competitive accuracy with substantial latency reductions, demonstrating the practical viability of efficient ViT-based FGIR for large-scale deployments.
Abstract
Large-scale fine-grained image retrieval (FGIR) aims to retrieve images belonging to the same subcategory as a given query by capturing subtle differences in a large-scale setting. Recently, Vision Transformers (ViT) have been employed in FGIR due to their powerful self-attention mechanism for modeling long-range dependencies. However, most Transformer-based methods focus primarily on leveraging self-attention to distinguish fine-grained details, while overlooking the high computational complexity and redundant dependencies inherent to these models, limiting their scalability and effectiveness in large-scale FGIR. In this paper, we propose an Efficient and Effective ViT-based framework, termed \textbf{EET}, which integrates token pruning module with a discriminative transfer strategy to address these limitations. Specifically, we introduce a content-based token pruning scheme to enhance the efficiency of the vanilla ViT, progressively removing background or low-discriminative tokens at different stages by exploiting feature responses and self-attention mechanism. To ensure the resulting efficient ViT retains strong discriminative power, we further present a discriminative transfer strategy comprising both \textit{discriminative knowledge transfer} and \textit{discriminative region guidance}. Using a distillation paradigm, these components transfer knowledge from a larger ``teacher'' ViT to a more efficient ``student'' model, guiding the latter to focus on subtle yet crucial regions in a cost-free manner. Extensive experiments on two widely-used fine-grained datasets and four large-scale fine-grained datasets demonstrate the effectiveness of our method. Specifically, EET reduces the inference latency of ViT-Small by 42.7\% and boosts the retrieval performance of 16-bit hash codes by 5.15\% on the challenging NABirds dataset.
