Table of Contents
Fetching ...

Tune without Validation: Searching for Learning Rate and Weight Decay on Training Sets

Lorenzo Brigato, Stavroula Mougiakakou

TL;DR

Tune without Validation (Twin) addresses the challenge of tuning the learning rate and weight decay without a validation set by performing a grid search over the LR-WD space on the training data, guided by a phase-diagram view of learning. It uses an early-/non-early-stopping scheduler to manage trials, logs training loss and parameter norms, and applies Quickshift-based region segmentation to identify a low-loss, low-norm region likely to generalize, selecting the final configuration as the one with the smallest norm within that region. Across 34 dataset-model configurations and 20 different datasets, Twin achieves a MAE of about $1.3\%$ relative to an Oracle pipeline, closely matching Oracle performance in IID and several OOD settings while outperforming validation-based baselines when validation data is scarce or noisy. The approach is demonstrated on small datasets, medical imaging, and natural images, with extensive ablations showing robustness to grid density, segmentation parameters, and optimizer choices, and it offers a scalable alternative that can reduce data collection costs by avoiding validation sets. Overall, Twin provides a practical, data-efficient method for tuning hyperparameters in image classification and has clear potential for extending to other domains and regularization schemes.

Abstract

We introduce Tune without Validation (Twin), a pipeline for tuning learning rate and weight decay without validation sets. We leverage a recent theoretical framework concerning learning phases in hypothesis space to devise a heuristic that predicts what hyper-parameter (HP) combinations yield better generalization. Twin performs a grid search of trials according to an early-/non-early-stopping scheduler and then segments the region that provides the best results in terms of training loss. Among these trials, the weight norm strongly correlates with predicting generalization. To assess the effectiveness of Twin, we run extensive experiments on 20 image classification datasets and train several families of deep networks, including convolutional, transformer, and feed-forward models. We demonstrate proper HP selection when training from scratch and fine-tuning, emphasizing small-sample scenarios.

Tune without Validation: Searching for Learning Rate and Weight Decay on Training Sets

TL;DR

Tune without Validation (Twin) addresses the challenge of tuning the learning rate and weight decay without a validation set by performing a grid search over the LR-WD space on the training data, guided by a phase-diagram view of learning. It uses an early-/non-early-stopping scheduler to manage trials, logs training loss and parameter norms, and applies Quickshift-based region segmentation to identify a low-loss, low-norm region likely to generalize, selecting the final configuration as the one with the smallest norm within that region. Across 34 dataset-model configurations and 20 different datasets, Twin achieves a MAE of about relative to an Oracle pipeline, closely matching Oracle performance in IID and several OOD settings while outperforming validation-based baselines when validation data is scarce or noisy. The approach is demonstrated on small datasets, medical imaging, and natural images, with extensive ablations showing robustness to grid density, segmentation parameters, and optimizer choices, and it offers a scalable alternative that can reduce data collection costs by avoiding validation sets. Overall, Twin provides a practical, data-efficient method for tuning hyperparameters in image classification and has clear potential for extending to other domains and regularization schemes.

Abstract

We introduce Tune without Validation (Twin), a pipeline for tuning learning rate and weight decay without validation sets. We leverage a recent theoretical framework concerning learning phases in hypothesis space to devise a heuristic that predicts what hyper-parameter (HP) combinations yield better generalization. Twin performs a grid search of trials according to an early-/non-early-stopping scheduler and then segments the region that provides the best results in terms of training loss. Among these trials, the weight norm strongly correlates with predicting generalization. To assess the effectiveness of Twin, we run extensive experiments on 20 image classification datasets and train several families of deep networks, including convolutional, transformer, and feed-forward models. We demonstrate proper HP selection when training from scratch and fine-tuning, emphasizing small-sample scenarios.
Paper Structure (41 sections, 5 figures, 6 tables)

This paper contains 41 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview. While traditional pipelines need a validation set to tune learning rate and weight decay, Twin performs the search directly on the training set, simplifying the process or saving additional data-collection costs.
  • Figure 2: Twin overview. Twin employs a gradient-based optimizer and a trial scheduler to perform a grid search across the LR-WD space. Twin logs train-loss and parameter-norm matrices to identify the network with the lowest norm within the fitting region. The parameter norm within this region is a good predictor of generalization (right plot). In this figure, we show as an example a WRN-16-10 trained on ciFAIR-10.
  • Figure 3: Overview of quantitative results. Twin scores an overall 1.3% MAE against the Oracle pipeline across 34 different dataset-model configurations when using a FIFO scheduler. Twin closely matches the Oracle in IID and OOD scenarios, while SelTS fails to correctly predict HPs that generalize in OOD cases.
  • Figure 4: Qualitative results. Visualization of the various steps of Twin in the LR-WD space (first four rows) and the relationship between the selected parameter norms and test loss (bottom row). The dashed green line represents the lowest achievable test loss.
  • Figure 5: Transfer learning. (Left) Normalized balanced accuracy of the Oracle with ImageNet pre-trained (top) or from-scratch RN50 (bottom). Feature overlap makes the best generalization appear with lower regularization, and Twin (with EN-B0, RN50, RNX101) plus early stopping identifies this region by scoring a low MAE (right).