Table of Contents
Fetching ...

Enhancing Fine-grained Image Classification through Attentive Batch Training

Duy M. Le, Bao Q. Bui, Anh Tran, Cong Tran, Cuong Pham

TL;DR

This work tackles fine-grained image classification by introducing Relationship Batch Integration (RBI), a batch-aware framework that exploits inter-image relationships within a training batch. RBI combines a Relationship Position Encoding (RPE) module, which encodes pairwise image similarities based on normalized PSNR-derived metrics, with Residual Relationship Attention (RRA) to fuse batch features and preserve original representations via a residual pathway. Empirical results across multiple backbones and datasets (including CUB-200-2011, Stanford Dogs, and NABirds) show consistent accuracy gains, with state-of-the-art performance on Stanford Dogs and notable improvements on others, while enabling smaller backbones to outperform larger baselines in some configurations. The approach is presented as a versatile plug-in refinement that can be integrated with existing networks to boost fine-grained recognition without substantial computational overhead.

Abstract

Fine-grained image classification, which is a challenging task in computer vision, requires precise differentiation among visually similar object categories. In this paper, we propose 1) a novel module called Residual Relationship Attention (RRA) that leverages the relationships between images within each training batch to effectively integrate visual feature vectors of batch images and 2) a novel technique called Relationship Position Encoding (RPE), which encodes the positions of relationships between original images in a batch and effectively preserves the relationship information between images within the batch. Additionally, we design a novel framework, namely Relationship Batch Integration (RBI), which utilizes RRA in conjunction with RPE, allowing the discernment of vital visual features that may remain elusive when examining a singular image representative of a particular class. Through extensive experiments, our proposed method demonstrates significant improvements in the accuracy of different fine-grained classifiers, with an average increase of $(+2.78\%)$ and $(+3.83\%)$ on the CUB200-2011 and Stanford Dog datasets, respectively, while achieving a state-of-the-art results $(95.79\%)$ on the Stanford Dog dataset. Despite not achieving the same level of improvement as in fine-grained image classification, our method still demonstrates its prowess in leveraging general image classification by attaining a state-of-the-art result of $(93.71\%)$ on the Tiny-Imagenet dataset. Furthermore, our method serves as a plug-in refinement module and can be easily integrated into different networks.

Enhancing Fine-grained Image Classification through Attentive Batch Training

TL;DR

This work tackles fine-grained image classification by introducing Relationship Batch Integration (RBI), a batch-aware framework that exploits inter-image relationships within a training batch. RBI combines a Relationship Position Encoding (RPE) module, which encodes pairwise image similarities based on normalized PSNR-derived metrics, with Residual Relationship Attention (RRA) to fuse batch features and preserve original representations via a residual pathway. Empirical results across multiple backbones and datasets (including CUB-200-2011, Stanford Dogs, and NABirds) show consistent accuracy gains, with state-of-the-art performance on Stanford Dogs and notable improvements on others, while enabling smaller backbones to outperform larger baselines in some configurations. The approach is presented as a versatile plug-in refinement that can be integrated with existing networks to boost fine-grained recognition without substantial computational overhead.

Abstract

Fine-grained image classification, which is a challenging task in computer vision, requires precise differentiation among visually similar object categories. In this paper, we propose 1) a novel module called Residual Relationship Attention (RRA) that leverages the relationships between images within each training batch to effectively integrate visual feature vectors of batch images and 2) a novel technique called Relationship Position Encoding (RPE), which encodes the positions of relationships between original images in a batch and effectively preserves the relationship information between images within the batch. Additionally, we design a novel framework, namely Relationship Batch Integration (RBI), which utilizes RRA in conjunction with RPE, allowing the discernment of vital visual features that may remain elusive when examining a singular image representative of a particular class. Through extensive experiments, our proposed method demonstrates significant improvements in the accuracy of different fine-grained classifiers, with an average increase of and on the CUB200-2011 and Stanford Dog datasets, respectively, while achieving a state-of-the-art results on the Stanford Dog dataset. Despite not achieving the same level of improvement as in fine-grained image classification, our method still demonstrates its prowess in leveraging general image classification by attaining a state-of-the-art result of on the Tiny-Imagenet dataset. Furthermore, our method serves as a plug-in refinement module and can be easily integrated into different networks.
Paper Structure (15 sections, 13 equations, 8 figures, 1 table)

This paper contains 15 sections, 13 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Example of intra-batch feature fusion to enhance predictivity for target images.
  • Figure 2: Relationship Batch Integration (RBI) Framework
  • Figure 3: Performance comparison for RBIs using various batch sizes on both the Stanford Dogs dataset (on the left) and the CUB-200-2011 dataset (on the right). Note that experiments with large batch sizes on Densenet201-RBI, SwinT-Small-RBI, and ConvNeXtBase-RBI are omitted due to the GPU's memory constraints.
  • Figure 4: Comparison between features extracted by ConvNeXt-Large, ConvNeXt-Large-RBI, HERB-SwinT and HERB-SwinT-RBI on Stanford Dogs dataset, illustrated by GradCam.
  • Figure 5: The flow chart illustrates the GradCAM visualizations of features extracted by ConvNeXt-Large-RBI within a batch containing 8 images.
  • ...and 3 more figures