Enlightenment Period Improving DNN Performance
Tiantian Liu, Meng Wan, Jue Wang, Ningming Nie
TL;DR
The paper identifies an early, brief Enlightenment Period at the start of deep neural network training, during which representations transition from disordered to ordered. It builds a phase-transition grounded model to analyze Mixup-triggered gradient interference and Activation Revival, showing that interference weakens with larger sample or parameter counts. Based on these insights, it introduces three data-distribution strategies—Mixup Pause, Alpha Boost, and High-Loss Removal—that yield statistically significant performance gains across ViT and ResNet architectures on CIFAR and ImageNet, and even extend to time-series and language-model contexts. The work provides a practical, phase-aware framework for improving training by capitalizing on early dynamics, with open-source code to ensure reproducibility and broad applicability across domains.
Abstract
The start of deep neural network training is characterized by a brief yet critical phase that lasts from the beginning of the training until the accuracy reaches approximately 50\%. During this phase, disordered representations rapidly transition toward ordered structure, and we term this phase the Enlightenment Period. Through theoretical modeling based on phase transition theory and experimental validation, we reveal that applying Mixup data augmentation during this phase has a dual effect: it introduces a Gradient Interference Effect that hinders performance, while also providing a beneficial Activation Revival Effect to restore gradient updates for saturated neurons. We further demonstrate that this negative interference diminishes as the sample set size or the model parameter size increases, thereby shifting the balance between these two effects. Based on these findings, we propose three strategies that improve performance by solely adjusting the training data distribution within this brief period: the Mixup Pause Strategy for small-scale scenarios, the Alpha Boost Strategy for large-scale scenarios with underfitting, and the High-Loss Removal Strategy for tasks where Mixup is inapplicable (e.g., time series and large language models). Extensive experiments show that these strategies achieve superior performance across diverse architectures such as ViT and ResNet on datasets including CIFAR and ImageNet-1K. Ultimately, this work offers a novel perspective on enhancing model performance by strategically capitalizing on the dynamics of the brief and crucial early stages of training. Code is available at https://anonymous.4open.science/r/code-A5F1/.
