Table of Contents
Fetching ...

SAFLEX: Self-Adaptive Augmentation via Feature Label Extrapolation

Mucong Ding, Bang An, Yuancheng Xu, Anirudh Satheesh, Furong Huang

TL;DR

SAflex tackles the problem of noise and label errors introduced by upstream data augmentation pipelines by learning per-sample weights and soft labels for augmented instances through an efficient gradient-matching bilevel framework. By treating augmentation as a low-dimensional refinement task and solving via an online greedy approximation, SAFLEX remains a plug-in to existing pipelines and scales across medical imaging, tabular data, diffusion-based augmentations, and contrastive CLIP fine-tuning, achieving consistent gains with modest overhead. Its key contributions are a novel low-dimensional augmentation parametrization with a principled bilevel optimization approach, universal compatibility with diverse upstream methods, and extensive empirical validation revealing improvements such as a $1.2\%$ average gain across experiments. Overall, SAFLEX enables adapting current augmentation pipelines to new data types and tasks, promoting more robust, data-centric training without designing transformations from scratch.

Abstract

Data augmentation, a cornerstone technique in deep learning, is crucial in enhancing model performance, especially with scarce labeled data. While traditional techniques are effective, their reliance on hand-crafted methods limits their applicability across diverse data types and tasks. Although modern learnable augmentation methods offer increased adaptability, they are computationally expensive and challenging to incorporate within prevalent augmentation workflows. In this work, we present a novel, efficient method for data augmentation, effectively bridging the gap between existing augmentation strategies and emerging datasets and learning tasks. We introduce SAFLEX (Self-Adaptive Augmentation via Feature Label EXtrapolation), which learns the sample weights and soft labels of augmented samples provided by any given upstream augmentation pipeline, using a specifically designed efficient bilevel optimization algorithm. Remarkably, SAFLEX effectively reduces the noise and label errors of the upstream augmentation pipeline with a marginal computational cost. As a versatile module, SAFLEX excels across diverse datasets, including natural and medical images and tabular data, showcasing its prowess in few-shot learning and out-of-distribution generalization. SAFLEX seamlessly integrates with common augmentation strategies like RandAug, CutMix, and those from large pre-trained generative models like stable diffusion and is also compatible with frameworks such as CLIP's fine-tuning. Our findings highlight the potential to adapt existing augmentation pipelines for new data types and tasks, signaling a move towards more adaptable and resilient training frameworks.

SAFLEX: Self-Adaptive Augmentation via Feature Label Extrapolation

TL;DR

SAflex tackles the problem of noise and label errors introduced by upstream data augmentation pipelines by learning per-sample weights and soft labels for augmented instances through an efficient gradient-matching bilevel framework. By treating augmentation as a low-dimensional refinement task and solving via an online greedy approximation, SAFLEX remains a plug-in to existing pipelines and scales across medical imaging, tabular data, diffusion-based augmentations, and contrastive CLIP fine-tuning, achieving consistent gains with modest overhead. Its key contributions are a novel low-dimensional augmentation parametrization with a principled bilevel optimization approach, universal compatibility with diverse upstream methods, and extensive empirical validation revealing improvements such as a average gain across experiments. Overall, SAFLEX enables adapting current augmentation pipelines to new data types and tasks, promoting more robust, data-centric training without designing transformations from scratch.

Abstract

Data augmentation, a cornerstone technique in deep learning, is crucial in enhancing model performance, especially with scarce labeled data. While traditional techniques are effective, their reliance on hand-crafted methods limits their applicability across diverse data types and tasks. Although modern learnable augmentation methods offer increased adaptability, they are computationally expensive and challenging to incorporate within prevalent augmentation workflows. In this work, we present a novel, efficient method for data augmentation, effectively bridging the gap between existing augmentation strategies and emerging datasets and learning tasks. We introduce SAFLEX (Self-Adaptive Augmentation via Feature Label EXtrapolation), which learns the sample weights and soft labels of augmented samples provided by any given upstream augmentation pipeline, using a specifically designed efficient bilevel optimization algorithm. Remarkably, SAFLEX effectively reduces the noise and label errors of the upstream augmentation pipeline with a marginal computational cost. As a versatile module, SAFLEX excels across diverse datasets, including natural and medical images and tabular data, showcasing its prowess in few-shot learning and out-of-distribution generalization. SAFLEX seamlessly integrates with common augmentation strategies like RandAug, CutMix, and those from large pre-trained generative models like stable diffusion and is also compatible with frameworks such as CLIP's fine-tuning. Our findings highlight the potential to adapt existing augmentation pipelines for new data types and tasks, signaling a move towards more adaptable and resilient training frameworks.
Paper Structure (9 sections, 1 theorem, 14 equations, 3 figures, 8 tables, 1 algorithm)

This paper contains 9 sections, 1 theorem, 14 equations, 3 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

The approximated soft label solution is $\mathbf{y}=\text{OneHot}\left(\arg\max_{k}[\mathbf{\Pi}]_k\right)$, where $\text{OneHot}(\cdot)$ denotes one-hot encoding, and the sample weight solution is $w=1$ if $\sum_{k=1}^K [\mathbf{\Pi}]_k \geq 0$; otherwise, $w=0$.

Figures (3)

  • Figure 1: SAflex learns to adjust sample weights and soft labels of augmented samples from an upstream pipeline, aiming to maximize the model's performance on the validation set. While formulated as a bilevel optimization problem, it can be efficiently solved by linear programming with a gradient-matching objective. SAflex is a plug-in to the existing training framework.
  • Figure 2: (a) Under-augmentation can lead to a scarcity of hard positives, while over-augmentation can introduce an excess of false positives. Reducing the noise in augmentation helps resolve the dilemma. (b) Adjusting sample weights and recalibrating soft labels can address the two types of noises introduced by the augmentation process.
  • Figure : SAflex (Cross-Entropy Loss, Single batch).

Theorems & Definitions (1)

  • Theorem 1: Solution of \ref{['eq:greedy']}