Table of Contents
Fetching ...

Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

Edwin Arkel Rios, Min-Chun Hu, Bo-Cheng Lai

TL;DR

GLSim tackles fine-grained image recognition with Vision Transformers by replacing expensive attention-based region selection with Global-Local Similarity between the CLS token and patch tokens to identify discriminative crops. The cropped regions are re-encoded and fused with the original image's high-level features via an Aggregator, enabling robust predictions at a fraction of the computational cost. Extensive experiments across 10 FGIR datasets show GLSim achieving state-of-the-art or competitive accuracy while drastically reducing memory and inference cost compared to attention-rollout methods. The approach also offers interpretable visualizations of discriminative regions and shows strong potential for visualization and downstream tasks beyond classification.

Abstract

Fine-grained recognition involves the classification of images from subordinate macro-categories, and it is challenging due to small inter-class differences. To overcome this, most methods perform discriminative feature selection enabled by a feature extraction backbone followed by a high-level feature refinement step. Recently, many studies have shown the potential behind vision transformers as a backbone for fine-grained recognition, but their usage of its attention mechanism to select discriminative tokens can be computationally expensive. In this work, we propose a novel and computationally inexpensive metric to identify discriminative regions in an image. We compare the similarity between the global representation of an image given by the CLS token, a learnable token used by transformers for classification, and the local representation of individual patches. We select the regions with the highest similarity to obtain crops, which are forwarded through the same transformer encoder. Finally, high-level features of the original and cropped representations are further refined together in order to make more robust predictions. Through extensive experimental evaluation we demonstrate the effectiveness of our proposed method, obtaining favorable results in terms of accuracy across a variety of datasets. Furthermore, our method achieves these results at a much lower computational cost compared to the alternatives. Code and checkpoints are available at: \url{https://github.com/arkel23/GLSim}.

Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

TL;DR

GLSim tackles fine-grained image recognition with Vision Transformers by replacing expensive attention-based region selection with Global-Local Similarity between the CLS token and patch tokens to identify discriminative crops. The cropped regions are re-encoded and fused with the original image's high-level features via an Aggregator, enabling robust predictions at a fraction of the computational cost. Extensive experiments across 10 FGIR datasets show GLSim achieving state-of-the-art or competitive accuracy while drastically reducing memory and inference cost compared to attention-rollout methods. The approach also offers interpretable visualizations of discriminative regions and shows strong potential for visualization and downstream tasks beyond classification.

Abstract

Fine-grained recognition involves the classification of images from subordinate macro-categories, and it is challenging due to small inter-class differences. To overcome this, most methods perform discriminative feature selection enabled by a feature extraction backbone followed by a high-level feature refinement step. Recently, many studies have shown the potential behind vision transformers as a backbone for fine-grained recognition, but their usage of its attention mechanism to select discriminative tokens can be computationally expensive. In this work, we propose a novel and computationally inexpensive metric to identify discriminative regions in an image. We compare the similarity between the global representation of an image given by the CLS token, a learnable token used by transformers for classification, and the local representation of individual patches. We select the regions with the highest similarity to obtain crops, which are forwarded through the same transformer encoder. Finally, high-level features of the original and cropped representations are further refined together in order to make more robust predictions. Through extensive experimental evaluation we demonstrate the effectiveness of our proposed method, obtaining favorable results in terms of accuracy across a variety of datasets. Furthermore, our method achieves these results at a much lower computational cost compared to the alternatives. Code and checkpoints are available at: \url{https://github.com/arkel23/GLSim}.
Paper Structure (32 sections, 6 equations, 14 figures, 15 tables)

This paper contains 32 sections, 6 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: Overall flow of our proposed system, GLSim. Starting from the bottom left corner, an image is patchified and passed through a series of transformer encoder blocks to extract features. These are then used by the GLS Module to select discriminative crops, as indicated by the dashed-lines. The GLS Module crops the image according to the coordinates corresponding to the top-$O$ tokens with the highest similarity between global and local representations of the encoded image. The cropped image is then passed through the same encoder. Finally, high-level features from the original and cropped image are refined using an Aggregator module before making the final predictions.
  • Figure 2: Visualization of discriminative feature selection mechanisms for a fine-tuned ViT B-16 on DAFB branwen_danbooru2021_2015rios_anime_2022 (first 2 rows) and iNat17 van_horn_inaturalist_2018 (last 4 rows) dataset. From left to right: original image, head-wise average of attention scores of last layer, MAWS wang_feature_2021 of last layer, PSM he_transfg_2022, attention rollout abnar_quantifying_2020, and our proposed global-local similarity.
  • Figure 3: Visualization of samples from various fine-grained datasets. First row shows the original images. Second and third rows show the heatmap for the proposed global-local similarity (GLS) metric and crops obtained based on it. Fourth row shows crops for CAL rao_counterfactual_2021.
  • Figure 4: Accuracy and throughput plots for the evaluated models on CUB. Model$^*$ represents results for image size 448x448. We highlight the accuracy and throughput for our proposed GLSim method with image size 224x224.
  • Figure 5: Visualization of attention-based visualization mechanisms and our proposed GLS for DINOv2 B-14 oquab_dinov2_2023 on NABirds.
  • ...and 9 more figures