Linearly Convergent Mixup Learning

Gakuto Obi; Ayato Saito; Yuto Sasaki; Tsuyoshi Kato

Linearly Convergent Mixup Learning

Gakuto Obi, Ayato Saito, Yuto Sasaki, Tsuyoshi Kato

TL;DR

The paper tackles the challenge of applying mixup data augmentation to learning in reproducing kernel Hilbert spaces (RKHS) by introducing two stochastic dual coordinate ascent–based algorithms that avoid learning-rate hyperparameters and guarantee linear convergence with iteration and cost scaling linearly in dataset size. It formulates the primal mixup-augmented objective in RKHS, analyzes the difficulties of a naïve dual formulation due to infimal convolutions, and then provides two scalable solutions: an approximation-based SDCA and a decomposition-based SDCA. Empirical results on binary toxicity prediction show that mixup improves predictive performance across loss functions, with the approximation method delivering the fastest convergence. The work advances kernel-method optimization under data augmentation and suggests avenues for applying these ideas to broader tasks and privacy-preserving settings.

Abstract

Learning in the reproducing kernel Hilbert space (RKHS) such as the support vector machine has been recognized as a promising technique. It continues to be highly effective and competitive in numerous prediction tasks, particularly in settings where there is a shortage of training data or computational limitations exist. These methods are especially valued for their ability to work with small datasets and their interpretability. To address the issue of limited training data, mixup data augmentation, widely used in deep learning, has remained challenging to apply to learning in RKHS due to the generation of intermediate class labels. Although gradient descent methods handle these labels effectively, dual optimization approaches are typically not directly applicable. In this study, we present two novel algorithms that extend to a broader range of binary classification models. Unlike gradient-based approaches, our algorithms do not require hyperparameters like learning rates, simplifying their implementation and optimization. Both the number of iterations to converge and the computational cost per iteration scale linearly with respect to the dataset size. The numerical experiments demonstrate that our algorithms achieve faster convergence to the optimal solution compared to gradient descent approaches, and that mixup data augmentation consistently improves the predictive performance across various loss functions.

Linearly Convergent Mixup Learning

TL;DR

Abstract

Paper Structure (15 sections, 5 theorems, 51 equations, 3 figures, 2 tables, 3 algorithms)

This paper contains 15 sections, 5 theorems, 51 equations, 3 figures, 2 tables, 3 algorithms.

Introduction
Related Work
Primal problem
Naïve dual problem and its challenge
Approximation approach
Decomposition approach
Experiments
Prediction performance
Runtime
Conclusions
Smooth loss functions
Proof for Lemma \ref{['lem:errp-if-geodecr']}
Proof for Theorem \ref{['thm:approx-beta']}
Proof for Lemma \ref{['lem:tilphi-smooth']}
Proof for Theorem \ref{['thm:decomp-beta']}

Key Result

Lemma 1

Consider a randomized algorithm that computes ${\bm{\alpha}}^{(t)}\in{\mathbb{R}}^{n}$ from ${\bm{\alpha}}^{(t-1)}\in{\mathbb{R}}^{n}$. Suppose that there exists a constant $\beta$ such that $0 < \beta < 1$ and Then, for any constant $\epsilon_{\text{P}}>0$, it holds that ${\mathbb{E}}\left[h_{\text{P}}^{(t)}\right]\le \epsilon_{\text{P}}$ for

Figures (3)

Figure 1: $\textsc{MixupSDCA}_{\text{na\"{i}ve}}$.
Figure 2: Determine $\widetilde{F}_{t}$.
Figure 3: $\textsc{MixupSDCA}_{\text{approx}}$ for maximizing $D_{0}({\bm{\alpha}})$.

Theorems & Definitions (5)

Lemma 1
Lemma 2
Theorem 1
Lemma 3
Theorem 2

Linearly Convergent Mixup Learning

TL;DR

Abstract

Linearly Convergent Mixup Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (5)