SAFLEX: Self-Adaptive Augmentation via Feature Label Extrapolation
Mucong Ding, Bang An, Yuancheng Xu, Anirudh Satheesh, Furong Huang
TL;DR
SAflex tackles the problem of noise and label errors introduced by upstream data augmentation pipelines by learning per-sample weights and soft labels for augmented instances through an efficient gradient-matching bilevel framework. By treating augmentation as a low-dimensional refinement task and solving via an online greedy approximation, SAFLEX remains a plug-in to existing pipelines and scales across medical imaging, tabular data, diffusion-based augmentations, and contrastive CLIP fine-tuning, achieving consistent gains with modest overhead. Its key contributions are a novel low-dimensional augmentation parametrization with a principled bilevel optimization approach, universal compatibility with diverse upstream methods, and extensive empirical validation revealing improvements such as a $1.2\%$ average gain across experiments. Overall, SAFLEX enables adapting current augmentation pipelines to new data types and tasks, promoting more robust, data-centric training without designing transformations from scratch.
Abstract
Data augmentation, a cornerstone technique in deep learning, is crucial in enhancing model performance, especially with scarce labeled data. While traditional techniques are effective, their reliance on hand-crafted methods limits their applicability across diverse data types and tasks. Although modern learnable augmentation methods offer increased adaptability, they are computationally expensive and challenging to incorporate within prevalent augmentation workflows. In this work, we present a novel, efficient method for data augmentation, effectively bridging the gap between existing augmentation strategies and emerging datasets and learning tasks. We introduce SAFLEX (Self-Adaptive Augmentation via Feature Label EXtrapolation), which learns the sample weights and soft labels of augmented samples provided by any given upstream augmentation pipeline, using a specifically designed efficient bilevel optimization algorithm. Remarkably, SAFLEX effectively reduces the noise and label errors of the upstream augmentation pipeline with a marginal computational cost. As a versatile module, SAFLEX excels across diverse datasets, including natural and medical images and tabular data, showcasing its prowess in few-shot learning and out-of-distribution generalization. SAFLEX seamlessly integrates with common augmentation strategies like RandAug, CutMix, and those from large pre-trained generative models like stable diffusion and is also compatible with frameworks such as CLIP's fine-tuning. Our findings highlight the potential to adapt existing augmentation pipelines for new data types and tasks, signaling a move towards more adaptable and resilient training frameworks.
