Table of Contents
Fetching ...

Local and Global Context-and-Object-part-Aware Superpixel-based Data Augmentation for Deep Visual Recognition

Fadi Dornaika, Danyang Sun

TL;DR

LGCOAMix addresses limitations of traditional cutmix by introducing a superpixel-based grid mixing mechanism and a semantic, attention-guided label mixing strategy that preserves object-part information with a single forward pass. The approach combines superpixel pooling, self-attention, and discriminative superpixel selection to enable both global classification improvements and targeted local learning, complemented by cross-image contrastive supervision. Empirical results across diverse datasets and backbones show consistent improvements over state-of-the-art cutmix variants and even strong WSOL performance, with demonstrated applicability to both CNN and Transformer architectures. The method achieves strong generalization with efficient inference, making it practical for real-world visual recognition tasks.

Abstract

Cutmix-based data augmentation, which uses a cut-and-paste strategy, has shown remarkable generalization capabilities in deep learning. However, existing methods primarily consider global semantics with image-level constraints, which excessively reduces attention to the discriminative local context of the class and leads to a performance improvement bottleneck. Moreover, existing methods for generating augmented samples usually involve cutting and pasting rectangular or square regions, resulting in a loss of object part information. To mitigate the problem of inconsistency between the augmented image and the generated mixed label, existing methods usually require double forward propagation or rely on an external pre-trained network for object centering, which is inefficient. To overcome the above limitations, we propose LGCOAMix, an efficient context-aware and object-part-aware superpixel-based grid blending method for data augmentation. To the best of our knowledge, this is the first time that a label mixing strategy using a superpixel attention approach has been proposed for cutmix-based data augmentation. It is the first instance of learning local features from discriminative superpixel-wise regions and cross-image superpixel contrasts. Extensive experiments on various benchmark datasets show that LGCOAMix outperforms state-of-the-art cutmix-based data augmentation methods on classification tasks, {and weakly supervised object location on CUB200-2011.} We have demonstrated the effectiveness of LGCOAMix not only for CNN networks, but also for Transformer networks. Source codes are available at https://github.com/DanielaPlusPlus/LGCOAMix.

Local and Global Context-and-Object-part-Aware Superpixel-based Data Augmentation for Deep Visual Recognition

TL;DR

LGCOAMix addresses limitations of traditional cutmix by introducing a superpixel-based grid mixing mechanism and a semantic, attention-guided label mixing strategy that preserves object-part information with a single forward pass. The approach combines superpixel pooling, self-attention, and discriminative superpixel selection to enable both global classification improvements and targeted local learning, complemented by cross-image contrastive supervision. Empirical results across diverse datasets and backbones show consistent improvements over state-of-the-art cutmix variants and even strong WSOL performance, with demonstrated applicability to both CNN and Transformer architectures. The method achieves strong generalization with efficient inference, making it practical for real-world visual recognition tasks.

Abstract

Cutmix-based data augmentation, which uses a cut-and-paste strategy, has shown remarkable generalization capabilities in deep learning. However, existing methods primarily consider global semantics with image-level constraints, which excessively reduces attention to the discriminative local context of the class and leads to a performance improvement bottleneck. Moreover, existing methods for generating augmented samples usually involve cutting and pasting rectangular or square regions, resulting in a loss of object part information. To mitigate the problem of inconsistency between the augmented image and the generated mixed label, existing methods usually require double forward propagation or rely on an external pre-trained network for object centering, which is inefficient. To overcome the above limitations, we propose LGCOAMix, an efficient context-aware and object-part-aware superpixel-based grid blending method for data augmentation. To the best of our knowledge, this is the first time that a label mixing strategy using a superpixel attention approach has been proposed for cutmix-based data augmentation. It is the first instance of learning local features from discriminative superpixel-wise regions and cross-image superpixel contrasts. Extensive experiments on various benchmark datasets show that LGCOAMix outperforms state-of-the-art cutmix-based data augmentation methods on classification tasks, {and weakly supervised object location on CUB200-2011.} We have demonstrated the effectiveness of LGCOAMix not only for CNN networks, but also for Transformer networks. Source codes are available at https://github.com/DanielaPlusPlus/LGCOAMix.

Paper Structure

This paper contains 27 sections, 12 equations, 9 figures, 11 tables, 2 algorithms.

Figures (9)

  • Figure 1: Comparison of augmented samples and label mixing methods. (a) LGCOAMix generates local object-part preserving augmented samples with superpixel-attention-based label mixing with a single forward propagation, which is more semantic and efficient than area-based label mixing. (b) park2022saliency uses saliency-based label mixing, but local object part information is lost because the mixing is in square form. (c) dornaika2023object and (d) walawalkar2020attentive use area-based label mixing with object centering. (c) dornaika2023object requires double forward propagation. (d) walawalkar2020attentive requires an external pre-trained network. In (d), the local object part information is also lost. (e) yun2019cutmix encounters inconsistencies between the augmented image and the generated mixed label and loses the local object part information.
  • Figure 2: The overall framework of our LGCOAMix method.
  • Figure 3: (a) Superpixel pooling and self-attention aims to capture local contextual and object-part information; (b) The detailed architecture of superpixel pooling and self-attention, followed by selection.
  • Figure 4: Studies of the loss weights (The quantitative performance improvements can be seen in Table \ref{['table9']}.) (a) Acc. for CIFAR100 with ResNet18 encoder when fixed $\gamma_{2}=0.05$ for selecting $\gamma_{1}$, and fixed $\gamma_{1}=0.1$ for selecting $\gamma_{2}$; (b) Training loss over epochs for CIFAR100 with ResNet18 encoder when fixed $\gamma_{1}=0.1$, $\gamma_{2}=0.05$.
  • Figure 5: (a)(b) Source images of size $(224,224)$; (f)(g) Source images of size $(64,64)$; (c)-(j) are the augmented images with different input number of superpixels for source images. Note that the actual number of superpixels is not precisely equal to the input number of superpixels.
  • ...and 4 more figures