Table of Contents
Fetching ...

Enlightenment Period Improving DNN Performance

Tiantian Liu, Meng Wan, Jue Wang, Ningming Nie

TL;DR

The paper identifies an early, brief Enlightenment Period at the start of deep neural network training, during which representations transition from disordered to ordered. It builds a phase-transition grounded model to analyze Mixup-triggered gradient interference and Activation Revival, showing that interference weakens with larger sample or parameter counts. Based on these insights, it introduces three data-distribution strategies—Mixup Pause, Alpha Boost, and High-Loss Removal—that yield statistically significant performance gains across ViT and ResNet architectures on CIFAR and ImageNet, and even extend to time-series and language-model contexts. The work provides a practical, phase-aware framework for improving training by capitalizing on early dynamics, with open-source code to ensure reproducibility and broad applicability across domains.

Abstract

The start of deep neural network training is characterized by a brief yet critical phase that lasts from the beginning of the training until the accuracy reaches approximately 50\%. During this phase, disordered representations rapidly transition toward ordered structure, and we term this phase the Enlightenment Period. Through theoretical modeling based on phase transition theory and experimental validation, we reveal that applying Mixup data augmentation during this phase has a dual effect: it introduces a Gradient Interference Effect that hinders performance, while also providing a beneficial Activation Revival Effect to restore gradient updates for saturated neurons. We further demonstrate that this negative interference diminishes as the sample set size or the model parameter size increases, thereby shifting the balance between these two effects. Based on these findings, we propose three strategies that improve performance by solely adjusting the training data distribution within this brief period: the Mixup Pause Strategy for small-scale scenarios, the Alpha Boost Strategy for large-scale scenarios with underfitting, and the High-Loss Removal Strategy for tasks where Mixup is inapplicable (e.g., time series and large language models). Extensive experiments show that these strategies achieve superior performance across diverse architectures such as ViT and ResNet on datasets including CIFAR and ImageNet-1K. Ultimately, this work offers a novel perspective on enhancing model performance by strategically capitalizing on the dynamics of the brief and crucial early stages of training. Code is available at https://anonymous.4open.science/r/code-A5F1/.

Enlightenment Period Improving DNN Performance

TL;DR

The paper identifies an early, brief Enlightenment Period at the start of deep neural network training, during which representations transition from disordered to ordered. It builds a phase-transition grounded model to analyze Mixup-triggered gradient interference and Activation Revival, showing that interference weakens with larger sample or parameter counts. Based on these insights, it introduces three data-distribution strategies—Mixup Pause, Alpha Boost, and High-Loss Removal—that yield statistically significant performance gains across ViT and ResNet architectures on CIFAR and ImageNet, and even extend to time-series and language-model contexts. The work provides a practical, phase-aware framework for improving training by capitalizing on early dynamics, with open-source code to ensure reproducibility and broad applicability across domains.

Abstract

The start of deep neural network training is characterized by a brief yet critical phase that lasts from the beginning of the training until the accuracy reaches approximately 50\%. During this phase, disordered representations rapidly transition toward ordered structure, and we term this phase the Enlightenment Period. Through theoretical modeling based on phase transition theory and experimental validation, we reveal that applying Mixup data augmentation during this phase has a dual effect: it introduces a Gradient Interference Effect that hinders performance, while also providing a beneficial Activation Revival Effect to restore gradient updates for saturated neurons. We further demonstrate that this negative interference diminishes as the sample set size or the model parameter size increases, thereby shifting the balance between these two effects. Based on these findings, we propose three strategies that improve performance by solely adjusting the training data distribution within this brief period: the Mixup Pause Strategy for small-scale scenarios, the Alpha Boost Strategy for large-scale scenarios with underfitting, and the High-Loss Removal Strategy for tasks where Mixup is inapplicable (e.g., time series and large language models). Extensive experiments show that these strategies achieve superior performance across diverse architectures such as ViT and ResNet on datasets including CIFAR and ImageNet-1K. Ultimately, this work offers a novel perspective on enhancing model performance by strategically capitalizing on the dynamics of the brief and crucial early stages of training. Code is available at https://anonymous.4open.science/r/code-A5F1/.

Paper Structure

This paper contains 31 sections, 25 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: 2D Embedding Visualization of three selected classes from CIFAR-10 using ResNet18. Comparing two training strategies: (1) vanilla training (2) Training with Input Mixup.
  • Figure 2: Statistics of Cosine Similarity Between Vanilla and Mixup Gradient Updates. Avg. grad cos sim: Average cosine similarity of gradients across all sample pairs; Prop. ($\bm{\cos<0.5}$): Proportion of sample pairs with a gradient angle exceeding $60^\circ$; Prop. ($\bm{\cos<0}$) : Proportion of sample pairs with a gradient angle exceeding $90^\circ$
  • Figure 3: Experimental results for verifying gradient interference and its diminishing effect
  • Figure 4: Zero-Value Activations per Sample (ResNet18 on CIFAR-100: Vanilla vs Mixup)
  • Figure 5: Acc Improvement of Models Trained with the Mixup Pause Strategy Over the Baseline on Cifar100
  • ...and 3 more figures