Table of Contents
Fetching ...

Multi-scale Activation, Refinement, and Aggregation: Exploring Diverse Cues for Fine-Grained Bird Recognition

Zhicheng Zhang, Hao Tang, Jinhui Tang

TL;DR

This work addresses fine-grained bird recognition under significant scale variation and background clutter by introducing MDCM, a framework that leverages an MS-ViT backbone through an Activation-Selection-Aggregation paradigm. It deploys Multi-Scale Cue Activation to diversify stage-specific cues, Multi-Scale Token Selection to prune noise while preserving critical scale-specific features, and Multi-Scale Dynamic Aggregation to adaptively fuse predictions across scales. The approach yields consistent improvements over CNN- and ViT-based baselines across CUB-200-2011, NABirds, and iNat2017, demonstrating both accuracy gains and efficient multi-scale representations. Overall, MDCM enhances FGBR by learning diverse, discriminative cues at multiple scales and aggregating them through a learned gating mechanism, with potential impact on ecological monitoring and large-scale species recognition.

Abstract

Given the critical role of birds in ecosystems, Fine-Grained Bird Recognition (FGBR) has gained increasing attention, particularly in distinguishing birds within similar subcategories. Although Vision Transformer (ViT)-based methods often outperform Convolutional Neural Network (CNN)-based methods in FGBR, recent studies reveal that the limited receptive field of plain ViT model hinders representational richness and makes them vulnerable to scale variance. Thus, enhancing the multi-scale capabilities of existing ViT-based models to overcome this bottleneck in FGBR is a worthwhile pursuit. In this paper, we propose a novel framework for FGBR, namely Multi-scale Diverse Cues Modeling (MDCM), which explores diverse cues at different scales across various stages of a multi-scale Vision Transformer (MS-ViT) in an "Activation-Selection-Aggregation" paradigm. Specifically, we first propose a multi-scale cue activation module to ensure the discriminative cues learned at different stage are mutually different. Subsequently, a multi-scale token selection mechanism is proposed to remove redundant noise and highlight discriminative, scale-specific cues at each stage. Finally, the selected tokens from each stage are independently utilized for bird recognition, and the recognition results from multiple stages are adaptively fused through a multi-scale dynamic aggregation mechanism for final model decisions. Both qualitative and quantitative results demonstrate the effectiveness of our proposed MDCM, which outperforms CNN- and ViT-based models on several widely-used FGBR benchmarks.

Multi-scale Activation, Refinement, and Aggregation: Exploring Diverse Cues for Fine-Grained Bird Recognition

TL;DR

This work addresses fine-grained bird recognition under significant scale variation and background clutter by introducing MDCM, a framework that leverages an MS-ViT backbone through an Activation-Selection-Aggregation paradigm. It deploys Multi-Scale Cue Activation to diversify stage-specific cues, Multi-Scale Token Selection to prune noise while preserving critical scale-specific features, and Multi-Scale Dynamic Aggregation to adaptively fuse predictions across scales. The approach yields consistent improvements over CNN- and ViT-based baselines across CUB-200-2011, NABirds, and iNat2017, demonstrating both accuracy gains and efficient multi-scale representations. Overall, MDCM enhances FGBR by learning diverse, discriminative cues at multiple scales and aggregating them through a learned gating mechanism, with potential impact on ecological monitoring and large-scale species recognition.

Abstract

Given the critical role of birds in ecosystems, Fine-Grained Bird Recognition (FGBR) has gained increasing attention, particularly in distinguishing birds within similar subcategories. Although Vision Transformer (ViT)-based methods often outperform Convolutional Neural Network (CNN)-based methods in FGBR, recent studies reveal that the limited receptive field of plain ViT model hinders representational richness and makes them vulnerable to scale variance. Thus, enhancing the multi-scale capabilities of existing ViT-based models to overcome this bottleneck in FGBR is a worthwhile pursuit. In this paper, we propose a novel framework for FGBR, namely Multi-scale Diverse Cues Modeling (MDCM), which explores diverse cues at different scales across various stages of a multi-scale Vision Transformer (MS-ViT) in an "Activation-Selection-Aggregation" paradigm. Specifically, we first propose a multi-scale cue activation module to ensure the discriminative cues learned at different stage are mutually different. Subsequently, a multi-scale token selection mechanism is proposed to remove redundant noise and highlight discriminative, scale-specific cues at each stage. Finally, the selected tokens from each stage are independently utilized for bird recognition, and the recognition results from multiple stages are adaptively fused through a multi-scale dynamic aggregation mechanism for final model decisions. Both qualitative and quantitative results demonstrate the effectiveness of our proposed MDCM, which outperforms CNN- and ViT-based models on several widely-used FGBR benchmarks.

Paper Structure

This paper contains 21 sections, 14 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The primary challenges in FGBR are evident from bird images. Figures (a) to (d) show subtle species differences, often obscured by complex backgrounds. Figures (e) to (h) highlight significant scale variations between distant and close-up shots, making the same body parts appear different and complicating the recognition task.
  • Figure 2: The framework of our MDCM. During the forward pass, MSCA adjusts the activation of feature map to ensure the cues learned at different stage are mutually different. Subsequently, MSTS extracts diverse cues from multiple stages and filters out noisy regions. Finally, MSDA dynamic aggregates the classification results for the final model decisions.
  • Figure 3: The visualization of the MSTS mechanism highlights the selected tokens marked with red rectangles.