Table of Contents
Fetching ...

ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation

Tobias Christian Nauen, Brian Moser, Federico Raue, Stanislav Frolov, Andreas Dengel

TL;DR

ForAug tackles biases and generalization gaps in Vision Transformer training by explicitly separating and recombining foreground objects with diverse backgrounds. The two-stage approach—offline segmentation with inpainting and online recombination with controlled size and placement—yields notable accuracy gains on ImageNet and improves downstream performance. Beyond performance, ForAug introduces fine-grained bias metrics (background robustness, foreground focus, center bias, size bias) and demonstrates substantial bias reduction, positioning it as both a training technique and a diagnostic tool. The work provides open-source code and precomputed segmentation outputs to facilitate reproducibility and further research across CV tasks.

Abstract

Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification. However, they often require large amounts of data and can exhibit biases, such as center or size bias, that limit their robustness and generalizability. This paper introduces ForAug, a novel data augmentation operation that addresses these challenges by explicitly imposing invariances into the training data, which are otherwise part of the neural network architecture. ForAug is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds. This recombination step enables us to take fine-grained control over object position and size, as well as background selection. We demonstrate that using ForAug significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet, which translates to 7.3 p.p. on downstream tasks. Importantly, ForAug not only improves accuracy but also opens new ways to analyze model behavior and quantify biases. Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that using ForAug during training substantially reduces these biases. In summary, ForAug provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models. Our code and dataset are publicly available at https://github.com/tobna/ForAug.

ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation

TL;DR

ForAug tackles biases and generalization gaps in Vision Transformer training by explicitly separating and recombining foreground objects with diverse backgrounds. The two-stage approach—offline segmentation with inpainting and online recombination with controlled size and placement—yields notable accuracy gains on ImageNet and improves downstream performance. Beyond performance, ForAug introduces fine-grained bias metrics (background robustness, foreground focus, center bias, size bias) and demonstrates substantial bias reduction, positioning it as both a training technique and a diagnostic tool. The work provides open-source code and precomputed segmentation outputs to facilitate reproducibility and further research across CV tasks.

Abstract

Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification. However, they often require large amounts of data and can exhibit biases, such as center or size bias, that limit their robustness and generalizability. This paper introduces ForAug, a novel data augmentation operation that addresses these challenges by explicitly imposing invariances into the training data, which are otherwise part of the neural network architecture. ForAug is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds. This recombination step enables us to take fine-grained control over object position and size, as well as background selection. We demonstrate that using ForAug significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet, which translates to 7.3 p.p. on downstream tasks. Importantly, ForAug not only improves accuracy but also opens new ways to analyze model behavior and quantify biases. Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that using ForAug during training substantially reduces these biases. In summary, ForAug provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models. Our code and dataset are publicly available at https://github.com/tobna/ForAug.

Paper Structure

This paper contains 20 sections, 5 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Comparison of traditional image classification training and training when using ForAug. ForAug recombines foreground objects with different backgrounds each epoch, thus creating a more diverse training set. We still apply strong traditional data augmentation afterwards.
  • Figure 2: Overview of ForAug. The data creation consists of two stages: Segmentation (offline, \ref{['sec:segmentation']}), where we segment the foreground objects from the background and fill in the background. Recombination (online, \ref{['sec:recombination']}), where we combine the foreground objects with different backgrounds to create new samples. After recombination, we apply strong, commonly used augmentation policies.
  • Figure 3: Evaluation of background robustness on ImageNet + ForAug, ImageNet9 and CounterAnimal. We plot the in-distribution (top of arrow) and the out-of-distribution (bottom of arrow) accuracy when training with and without ForAug. We annotate each arrow with its length $\Delta$. Training with ForAug improves the background robustness of all transformers by mostly boosting the out-of-distribution accuracy.
  • Figure 4: Evaluation of the foreground focus (\ref{['eq:fg-focus']}) using GradCam, GradCam++ and IntegratedGradients (IG) of models trained on ImageNet. Training with ForAug improves the foreground focus of almost all models.
  • Figure 5: Evaluation of the size bias of models trained on ImageNet. We plot the accuracy relative to the accuracy when using the default size ($f_\text{size} = 1.0$).
  • ...and 2 more figures