Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers
Edwin Arkel Rios, Min-Chun Hu, Bo-Cheng Lai
TL;DR
GLSim tackles fine-grained image recognition with Vision Transformers by replacing expensive attention-based region selection with Global-Local Similarity between the CLS token and patch tokens to identify discriminative crops. The cropped regions are re-encoded and fused with the original image's high-level features via an Aggregator, enabling robust predictions at a fraction of the computational cost. Extensive experiments across 10 FGIR datasets show GLSim achieving state-of-the-art or competitive accuracy while drastically reducing memory and inference cost compared to attention-rollout methods. The approach also offers interpretable visualizations of discriminative regions and shows strong potential for visualization and downstream tasks beyond classification.
Abstract
Fine-grained recognition involves the classification of images from subordinate macro-categories, and it is challenging due to small inter-class differences. To overcome this, most methods perform discriminative feature selection enabled by a feature extraction backbone followed by a high-level feature refinement step. Recently, many studies have shown the potential behind vision transformers as a backbone for fine-grained recognition, but their usage of its attention mechanism to select discriminative tokens can be computationally expensive. In this work, we propose a novel and computationally inexpensive metric to identify discriminative regions in an image. We compare the similarity between the global representation of an image given by the CLS token, a learnable token used by transformers for classification, and the local representation of individual patches. We select the regions with the highest similarity to obtain crops, which are forwarded through the same transformer encoder. Finally, high-level features of the original and cropped representations are further refined together in order to make more robust predictions. Through extensive experimental evaluation we demonstrate the effectiveness of our proposed method, obtaining favorable results in terms of accuracy across a variety of datasets. Furthermore, our method achieves these results at a much lower computational cost compared to the alternatives. Code and checkpoints are available at: \url{https://github.com/arkel23/GLSim}.
