Table of Contents
Fetching ...

Provable Benefit of Cutout and CutMix for Feature Learning

Junsoo Oh, Chulhee Yun

TL;DR

This paper studies two-layer neural networks trained using three distinct methods: vanilla training without augmentation, Cutout training, and CutMix training, and establishes that CutMix yields the highest test accuracy among the three.

Abstract

Patch-level data augmentation techniques such as Cutout and CutMix have demonstrated significant efficacy in enhancing the performance of vision tasks. However, a comprehensive theoretical understanding of these methods remains elusive. In this paper, we study two-layer neural networks trained using three distinct methods: vanilla training without augmentation, Cutout training, and CutMix training. Our analysis focuses on a feature-noise data model, which consists of several label-dependent features of varying rarity and label-independent noises of differing strengths. Our theorems demonstrate that Cutout training can learn low-frequency features that vanilla training cannot, while CutMix training can learn even rarer features that Cutout cannot capture. From this, we establish that CutMix yields the highest test accuracy among the three. Our novel analysis reveals that CutMix training makes the network learn all features and noise vectors "evenly" regardless of the rarity and strength, which provides an interesting insight into understanding patch-level augmentation.

Provable Benefit of Cutout and CutMix for Feature Learning

TL;DR

This paper studies two-layer neural networks trained using three distinct methods: vanilla training without augmentation, Cutout training, and CutMix training, and establishes that CutMix yields the highest test accuracy among the three.

Abstract

Patch-level data augmentation techniques such as Cutout and CutMix have demonstrated significant efficacy in enhancing the performance of vision tasks. However, a comprehensive theoretical understanding of these methods remains elusive. In this paper, we study two-layer neural networks trained using three distinct methods: vanilla training without augmentation, Cutout training, and CutMix training. Our analysis focuses on a feature-noise data model, which consists of several label-dependent features of varying rarity and label-independent noises of differing strengths. Our theorems demonstrate that Cutout training can learn low-frequency features that vanilla training cannot, while CutMix training can learn even rarer features that Cutout cannot capture. From this, we establish that CutMix yields the highest test accuracy among the three. Our novel analysis reveals that CutMix training makes the network learn all features and noise vectors "evenly" regardless of the rarity and strength, which provides an interesting insight into understanding patch-level augmentation.

Paper Structure

This paper contains 72 sections, 31 theorems, 328 equations, 6 figures, 1 table.

Key Result

Theorem 3.1

Let ${\bm{W}}^{(t)}$ be iterates of ERM. Then with probability at least $1- o \left( \frac{1}{\mathrm{poly}(d)} \right)$, there exists $T_\mathrm{ERM}$ such that any $T \in [T_\mathrm{ERM}, T^*]$ satisfies the following:

Figures (6)

  • Figure 1: Numerical results on our problem setting. We validate our findings on the trends of ERM, Cutout, and CutMix in learning common feature (Left), rare feature (Center), and extremely rare feature (Right). The output of the common feature trained by CutMix shows non-monotone behavior.
  • Figure 2: Histogram of dog prediction output subtracted by cat prediction output evaluated on data points augmented by CutMix data using cat data and dog data with varying mixing ratio $\lambda$ ($\text{Dog}:\text{Cat} = \lambda : 1-\lambda$) (Left) $\lambda = 1$ , (Center) $\lambda = 0.8$, (Right) $\lambda = 0.6$
  • Figure 3: Examples of rare data in CIFAR-10
  • Figure 4: Examples of extreme data in CIFAR-10
  • Figure 5: Multi-neuron with a smoothed leaky ReLU actiation
  • ...and 1 more figures

Theorems & Definitions (62)

  • Definition 2.1: Feature Noise Patch Data
  • Definition 2.2: 2-Layer CNN
  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Remark 4.1
  • Remark 4.2
  • Lemma B.2
  • proof : Proof of Lemma \ref{['lemma:initial']}
  • Lemma B.3
  • ...and 52 more