Table of Contents
Fetching ...

RC-Mixup: A Data Augmentation Strategy against Noisy Data for Regression Tasks

Seong-Hyeon Hwang, Minsu Kim, Steven Euijong Whang

TL;DR

RC-Mixup tackles noisy data in regression by tightly integrating C-Mixup with multi-round robust training, creating a data-centric augmentation loop that iteratively cleans data and refines augmented samples. The method interleaves data cleaning, selective mixing, and model updates, with dynamic bandwidth tuning to adapt to progressively cleaner data. Empirical results on synthetic and real datasets demonstrate substantial gains over C-Mixup and standalone robust-training baselines, with demonstrated compatibility with methods like O2U-Net and SELFIE. The work provides a practical, generalizable approach to noise-robust regression that can be deployed alongside existing robust-training frameworks.

Abstract

We study the problem of robust data augmentation for regression tasks in the presence of noisy data. Data augmentation is essential for generalizing deep learning models, but most of the techniques like the popular Mixup are primarily designed for classification tasks on image data. Recently, there are also Mixup techniques that are specialized to regression tasks like C-Mixup. In comparison to Mixup, which takes linear interpolations of pairs of samples, C-Mixup is more selective in which samples to mix based on their label distances for better regression performance. However, C-Mixup does not distinguish noisy versus clean samples, which can be problematic when mixing and lead to suboptimal model performance. At the same time, robust training has been heavily studied where the goal is to train accurate models against noisy data through multiple rounds of model training. We thus propose our data augmentation strategy RC-Mixup, which tightly integrates C-Mixup with multi-round robust training methods for a synergistic effect. In particular, C-Mixup improves robust training in identifying clean data, while robust training provides cleaner data to C-Mixup for it to perform better. A key advantage of RC-Mixup is that it is data-centric where the robust model training algorithm itself does not need to be modified, but can simply benefit from data mixing. We show in our experiments that RC-Mixup significantly outperforms C-Mixup and robust training baselines on noisy data benchmarks and can be integrated with various robust training methods.

RC-Mixup: A Data Augmentation Strategy against Noisy Data for Regression Tasks

TL;DR

RC-Mixup tackles noisy data in regression by tightly integrating C-Mixup with multi-round robust training, creating a data-centric augmentation loop that iteratively cleans data and refines augmented samples. The method interleaves data cleaning, selective mixing, and model updates, with dynamic bandwidth tuning to adapt to progressively cleaner data. Empirical results on synthetic and real datasets demonstrate substantial gains over C-Mixup and standalone robust-training baselines, with demonstrated compatibility with methods like O2U-Net and SELFIE. The work provides a practical, generalizable approach to noise-robust regression that can be deployed alongside existing robust-training frameworks.

Abstract

We study the problem of robust data augmentation for regression tasks in the presence of noisy data. Data augmentation is essential for generalizing deep learning models, but most of the techniques like the popular Mixup are primarily designed for classification tasks on image data. Recently, there are also Mixup techniques that are specialized to regression tasks like C-Mixup. In comparison to Mixup, which takes linear interpolations of pairs of samples, C-Mixup is more selective in which samples to mix based on their label distances for better regression performance. However, C-Mixup does not distinguish noisy versus clean samples, which can be problematic when mixing and lead to suboptimal model performance. At the same time, robust training has been heavily studied where the goal is to train accurate models against noisy data through multiple rounds of model training. We thus propose our data augmentation strategy RC-Mixup, which tightly integrates C-Mixup with multi-round robust training methods for a synergistic effect. In particular, C-Mixup improves robust training in identifying clean data, while robust training provides cleaner data to C-Mixup for it to perform better. A key advantage of RC-Mixup is that it is data-centric where the robust model training algorithm itself does not need to be modified, but can simply benefit from data mixing. We show in our experiments that RC-Mixup significantly outperforms C-Mixup and robust training baselines on noisy data benchmarks and can be integrated with various robust training methods.
Paper Structure (41 sections, 3 equations, 5 figures, 8 tables, 3 algorithms)

This paper contains 41 sections, 3 equations, 5 figures, 8 tables, 3 algorithms.

Figures (5)

  • Figure 1: RC-Mixup tightly integrates C-Mixup with multi-round robust training techniques for a synergistic effect: C-Mixup improves robust training in identifying clean data, while robust training provides (intermediate) clean data for C-Mixup. Suppose the x-axis is the only feature, and the y-axis is the label. Also, there are two clean samples $a$ and $b$ and three noisy samples $c$, $d$, and $e$. In Step 1, suppose that cleaning removes $d$ and $e$ (the exact outcome depends on the robust training technique). In Step 2, we perform C-Mixup possibly with bandwidth tuning to generate mixed samples. Here we mix the sample pairs ($a$, $c$) and ($b$, $c$) to generate the mixed samples denoted as star shapes. Notice that C-Mixup selectively mixes samples that have closer labels, so in this example ($a$, $b$) are not mixed. In Step 3, the augmented samples can be used to train an improved regression model, which can then be used for better cleaning in the next round.
  • Figure 2: We evaluate C-Mixup on noisy data where we add label noise to the Spectrum dataset. A lower RMSE means better model performance. (a) As the noise ratio increases, both ERM and C-Mixup gradually perform worse where their performance gap does not change much. (b) In addition, the optimal bandwidth may actually increase where mixing dilutes out-of-distribution data from negatively impacting the model performance.
  • Figure 3: Robust training benefits C-Mixup and vice versa. (a) As robust training iteratively cleans the data, C-Mixup's model performance improves. (b) Using C-Mixup within robust training helps it remove noisy data better compared to when not using C-Mixup.
  • Figure 4: (a)-(d) RC-Mixup model training convergence results. (e)-(h) RC-Mixup dynamic bandwidth tuning results. The tuned bandwidth values vary slightly depending on the random seed, and we show the bandwidth averaged across five random seeds (dotted red lines). (i)-(l) RC-Mixup performance results while varying the validation set size.
  • Figure 5: RC-Mixup performance when varying $L$ and $N$ used for bandwidth tuning on the Spectrum dataset.