Table of Contents
Fetching ...

M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition

Jiyong Moon, Junseok Lee, Yunju Lee, Seongsik Park

TL;DR

This work proposes MultiScale Patch Selection (MSPS), which outperforms CNN-/ViT-based models on several widely used FGVR benchmarks and encourages richer object representations based on feature hierarchy and consistently improves performance from small-sized to large-sized objects.

Abstract

Recently, vision Transformers (ViTs) have been actively applied to fine-grained visual recognition (FGVR). ViT can effectively model the interdependencies between patch-divided object regions through an inherent self-attention mechanism. In addition, patch selection is used with ViT to remove redundant patch information and highlight the most discriminative object patches. However, existing ViT-based FGVR models are limited to single-scale processing, and their fixed receptive fields hinder representational richness and exacerbate vulnerability to scale variability. Therefore, we propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models. Specifically, MSPS selects salient patches of different scales at different stages of a multi-scale vision Transformer (MS-ViT). In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to model cross-scale interactions between selected multi-scale patches and fully reflect them in model decisions. Compared to previous single-scale patch selection (SSPS), our proposed MSPS encourages richer object representations based on feature hierarchy and consistently improves performance from small-sized to large-sized objects. As a result, we propose M2Former, which outperforms CNN-/ViT-based models on several widely used FGVR benchmarks.

M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition

TL;DR

This work proposes MultiScale Patch Selection (MSPS), which outperforms CNN-/ViT-based models on several widely used FGVR benchmarks and encourages richer object representations based on feature hierarchy and consistently improves performance from small-sized to large-sized objects.

Abstract

Recently, vision Transformers (ViTs) have been actively applied to fine-grained visual recognition (FGVR). ViT can effectively model the interdependencies between patch-divided object regions through an inherent self-attention mechanism. In addition, patch selection is used with ViT to remove redundant patch information and highlight the most discriminative object patches. However, existing ViT-based FGVR models are limited to single-scale processing, and their fixed receptive fields hinder representational richness and exacerbate vulnerability to scale variability. Therefore, we propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models. Specifically, MSPS selects salient patches of different scales at different stages of a multi-scale vision Transformer (MS-ViT). In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to model cross-scale interactions between selected multi-scale patches and fully reflect them in model decisions. Compared to previous single-scale patch selection (SSPS), our proposed MSPS encourages richer object representations based on feature hierarchy and consistently improves performance from small-sized to large-sized objects. As a result, we propose M2Former, which outperforms CNN-/ViT-based models on several widely used FGVR benchmarks.
Paper Structure (29 sections, 16 equations, 6 figures, 8 tables)

This paper contains 29 sections, 16 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparison between previous single-scale patch selection (SSPS) and our multi-scale patch selection (MSPS). (a) SSPS extracts salient image patches of the same size, and the limited receptive field causes suboptimal object representation and vulnerability to scale variance. (b) On the other hand, our MSPS extracts salient patches in multi-scale. This encourages rich representations of objects, from deep semantic information in large-sized patches to fine-grained details in small-sized patches. In addition, the flexibility of multi-scale patches is useful for handling extremely large or small objects through multiple receptive fields.
  • Figure 2: The framework of our M2Former. MSPS conduct patch selection at each stage of the MViT backbone. For each intermediate feature map, several salient patches are selected based on score maps computed using mean activation. At the same time, the global CLS token is transferred to each stage, and the transferred CLS tokens are concatenated with the selected patch sequence. And then, patch sequences are passed through MSCA blocks consisting of CCA and SCA. Finally, the CLS tokens are detached from patch sequence of each stage, and the final prediction is conducted by aggregating them.
  • Figure 3: CCA and SCA constituting the MSCA block. (a) CCA recalibrates the channels of each stage-specific patches based on their cross-scale channel interdependencies. (b) For the same purpose, SCA captures spatial-wise interdependencies of selected multi-scale patches.
  • Figure 4: Example images for different scale objects. Following COCO ETC10, we classify objects into three scales: large, medium, and small, according to their bounding box size. The first row shows example images belonging to the large category (ob$_{l}$). The second row shows example images belonging to the medium category (ob$_{m}$). The third row shows example images belonging to the small category (ob$_{s}$).
  • Figure 5: Visualization results of the selected patches when MSPS was conducted at each stage. In each subfigure, the first column shows the original image, and the second to fifth columns show the patches selected from stage-4 to stage-1. The selected patches are marked with red rectangles.
  • ...and 1 more figures