Tune without Validation: Searching for Learning Rate and Weight Decay on Training Sets
Lorenzo Brigato, Stavroula Mougiakakou
TL;DR
Tune without Validation (Twin) addresses the challenge of tuning the learning rate and weight decay without a validation set by performing a grid search over the LR-WD space on the training data, guided by a phase-diagram view of learning. It uses an early-/non-early-stopping scheduler to manage trials, logs training loss and parameter norms, and applies Quickshift-based region segmentation to identify a low-loss, low-norm region likely to generalize, selecting the final configuration as the one with the smallest norm within that region. Across 34 dataset-model configurations and 20 different datasets, Twin achieves a MAE of about $1.3\%$ relative to an Oracle pipeline, closely matching Oracle performance in IID and several OOD settings while outperforming validation-based baselines when validation data is scarce or noisy. The approach is demonstrated on small datasets, medical imaging, and natural images, with extensive ablations showing robustness to grid density, segmentation parameters, and optimizer choices, and it offers a scalable alternative that can reduce data collection costs by avoiding validation sets. Overall, Twin provides a practical, data-efficient method for tuning hyperparameters in image classification and has clear potential for extending to other domains and regularization schemes.
Abstract
We introduce Tune without Validation (Twin), a pipeline for tuning learning rate and weight decay without validation sets. We leverage a recent theoretical framework concerning learning phases in hypothesis space to devise a heuristic that predicts what hyper-parameter (HP) combinations yield better generalization. Twin performs a grid search of trials according to an early-/non-early-stopping scheduler and then segments the region that provides the best results in terms of training loss. Among these trials, the weight norm strongly correlates with predicting generalization. To assess the effectiveness of Twin, we run extensive experiments on 20 image classification datasets and train several families of deep networks, including convolutional, transformer, and feed-forward models. We demonstrate proper HP selection when training from scratch and fine-tuning, emphasizing small-sample scenarios.
