Table of Contents
Fetching ...

Data augmentation instead of explicit regularization

Alex Hernández-García, Peter König

TL;DR

This paper challenges the necessity of explicit regularization techniques such as weight decay and dropout by contrasting them with data augmentation, an implicit regularizer. It defines explicit and implicit regularization, analyzes their theoretical effects via Rademacher complexity and generalization bounds, and presents an empirical study across ImageNet and CIFAR architectures (All-CNN, WRN, DenseNet). The results show data augmentation alone matches or surpasses explicit regularization, and that explicit methods hinder performance when data or architectural changes occur. The authors advocate prioritizing data augmentation as a primary inductive bias to improve generalization and reduce computational cost and environmental impact.

Abstract

Contrary to most machine learning models, modern deep artificial neural networks typically include multiple components that contribute to regularization. Despite the fact that some (explicit) regularization techniques, such as weight decay and dropout, require costly fine-tuning of sensitive hyperparameters, the interplay between them and other elements that provide implicit regularization is not well understood yet. Shedding light upon these interactions is key to efficiently using computational resources and may contribute to solving the puzzle of generalization in deep learning. Here, we first provide formal definitions of explicit and implicit regularization that help understand essential differences between techniques. Second, we contrast data augmentation with weight decay and dropout. Our results show that visual object categorization models trained with data augmentation alone achieve the same performance or higher than models trained also with weight decay and dropout, as is common practice. We conclude that the contribution on generalization of weight decay and dropout is not only superfluous when sufficient implicit regularization is provided, but also such techniques can dramatically deteriorate the performance if the hyperparameters are not carefully tuned for the architecture and data set. In contrast, data augmentation systematically provides large generalization gains and does not require hyperparameter re-tuning. In view of our results, we suggest to optimize neural networks without weight decay and dropout to save computational resources, hence carbon emissions, and focus more on data augmentation and other inductive biases to improve performance and robustness.

Data augmentation instead of explicit regularization

TL;DR

This paper challenges the necessity of explicit regularization techniques such as weight decay and dropout by contrasting them with data augmentation, an implicit regularizer. It defines explicit and implicit regularization, analyzes their theoretical effects via Rademacher complexity and generalization bounds, and presents an empirical study across ImageNet and CIFAR architectures (All-CNN, WRN, DenseNet). The results show data augmentation alone matches or surpasses explicit regularization, and that explicit methods hinder performance when data or architectural changes occur. The authors advocate prioritizing data augmentation as a primary inductive bias to improve generalization and reduce computational cost and environmental impact.

Abstract

Contrary to most machine learning models, modern deep artificial neural networks typically include multiple components that contribute to regularization. Despite the fact that some (explicit) regularization techniques, such as weight decay and dropout, require costly fine-tuning of sensitive hyperparameters, the interplay between them and other elements that provide implicit regularization is not well understood yet. Shedding light upon these interactions is key to efficiently using computational resources and may contribute to solving the puzzle of generalization in deep learning. Here, we first provide formal definitions of explicit and implicit regularization that help understand essential differences between techniques. Second, we contrast data augmentation with weight decay and dropout. Our results show that visual object categorization models trained with data augmentation alone achieve the same performance or higher than models trained also with weight decay and dropout, as is common practice. We conclude that the contribution on generalization of weight decay and dropout is not only superfluous when sufficient implicit regularization is provided, but also such techniques can dramatically deteriorate the performance if the hyperparameters are not carefully tuned for the architecture and data set. In contrast, data augmentation systematically provides large generalization gains and does not require hyperparameter re-tuning. In view of our results, we suggest to optimize neural networks without weight decay and dropout to save computational resources, hence carbon emissions, and focus more on data augmentation and other inductive biases to improve performance and robustness.

Paper Structure

This paper contains 24 sections, 2 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Visual summary of the experimental setup. The figure represents the factors of variation in our experiments: data sets, architectures, amount of training data, data augmentation scheme and inclusion of explicit regularization. Comparisons within a factor of variation are most relevant on the factors on the right, like the performance of the models train with and without explicit regularization.
  • Figure 2: Relative improvement of adding data augmentation and explicit regularization to the baseline models, $(accuracy - baseline)/accuracy * 100$. The baseline accuracy is shown on the left. The results suggest that data augmentation alone (in blue) can achieve even better performance than the models trained with both weight decay and dropout (in orange).
  • Figure 3: Bootstrap analysis to assess the difference in performance gain provided by training without and with weight decay and dropout, on the original architectures and using the full data sets. On the left of the figure we plot the bootstrap values---differences---with the mean and median as a solid and dashed line, respectively. The main figure shows the distribution of the mean of the bootstrap samples, the standard error of the sample mean, the 95 % confidence intervals and the $P$ value with respect to the null hypothesis ($H_0=0$).
  • Figure 4: Dynamics of the validation accuracy during training of All-CNN, WRN and DenseNet, trained on CIFAR-10 with heavier data augmentation, contrasting the models trained with explicit regularization (orange lines) and the models trained with only data augmentation (in blue). The regularized models heavily rely on the learning rate decay to obtain the boost of performance, while the models trained without explicit regularization quickly approach the final performance.
  • Figure 5: Fraction of the baseline performance when the amount of available training data is reduced, $accuracy/baseline * 100$. The models trained with explicit regularization present a significant drop in performance as compared to the models trained with only data augmentation. The differences become larger as the amount of training data decreases.
  • ...and 3 more figures