ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation
Tobias Christian Nauen, Brian Moser, Federico Raue, Stanislav Frolov, Andreas Dengel
TL;DR
ForAug tackles biases and generalization gaps in Vision Transformer training by explicitly separating and recombining foreground objects with diverse backgrounds. The two-stage approach—offline segmentation with inpainting and online recombination with controlled size and placement—yields notable accuracy gains on ImageNet and improves downstream performance. Beyond performance, ForAug introduces fine-grained bias metrics (background robustness, foreground focus, center bias, size bias) and demonstrates substantial bias reduction, positioning it as both a training technique and a diagnostic tool. The work provides open-source code and precomputed segmentation outputs to facilitate reproducibility and further research across CV tasks.
Abstract
Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification. However, they often require large amounts of data and can exhibit biases, such as center or size bias, that limit their robustness and generalizability. This paper introduces ForAug, a novel data augmentation operation that addresses these challenges by explicitly imposing invariances into the training data, which are otherwise part of the neural network architecture. ForAug is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds. This recombination step enables us to take fine-grained control over object position and size, as well as background selection. We demonstrate that using ForAug significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet, which translates to 7.3 p.p. on downstream tasks. Importantly, ForAug not only improves accuracy but also opens new ways to analyze model behavior and quantify biases. Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that using ForAug during training substantially reduces these biases. In summary, ForAug provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models. Our code and dataset are publicly available at https://github.com/tobna/ForAug.
