Improving Hyperparameter Optimization with Checkpointed Model Weights

Nikhil Mehta; Jonathan Lorraine; Steve Masson; Ramanathan Arunachalam; Zaid Pervaiz Bhat; James Lucas; Arun George Zachariah

Improving Hyperparameter Optimization with Checkpointed Model Weights

Nikhil Mehta, Jonathan Lorraine, Steve Masson, Ramanathan Arunachalam, Zaid Pervaiz Bhat, James Lucas, Arun George Zachariah

TL;DR

Forecasting Model Search (FMS) addresses the high expense of hyperparameter optimization by conditioning a Gaussian process surrogate on both hyperparameters and logged checkpoint weights via a permutation-invariant graph metanetwork (PIGMN). Built atop Dynamic Multifidelity HPO (DyHPO), FMS enhances prediction accuracy by encoding architecture- and training-dynamics information from weight checkpoints $\oldsymbol{W}$ into the surrogate, guiding budgeted evaluations for model selection from hubs and subsequent fine-tuning. Empirical results across multiple model hubs and datasets show that FMS-GMN achieves higher ranking quality (Kendall's $\tau$) and lower regret across compute budgets, with demonstrated transfer to unseen architectures and datasets. The approach is implemented with open-source code, enabling broader adoption and future extension to scalable surrogates and richer metadata integration.

Abstract

When training deep learning models, the performance depends largely on the selected hyperparameters. However, hyperparameter optimization (HPO) is often one of the most expensive parts of model design. Classical HPO methods treat this as a black-box optimization problem. However, gray-box HPO methods, which incorporate more information about the setup, have emerged as a promising direction for more efficient optimization. For example, using intermediate loss evaluations to terminate bad selections. In this work, we propose an HPO method for neural networks using logged checkpoints of the trained weights to guide future hyperparameter selections. Our method, Forecasting Model Search (FMS), embeds weights into a Gaussian process deep kernel surrogate model, using a permutation-invariant graph metanetwork to be data-efficient with the logged network weights. To facilitate reproducibility and further research, we open-source our code at https://github.com/NVlabs/forecasting-model-search.

Improving Hyperparameter Optimization with Checkpointed Model Weights

TL;DR

into the surrogate, guiding budgeted evaluations for model selection from hubs and subsequent fine-tuning. Empirical results across multiple model hubs and datasets show that FMS-GMN achieves higher ranking quality (Kendall's

) and lower regret across compute budgets, with demonstrated transfer to unseen architectures and datasets. The approach is implemented with open-source code, enabling broader adoption and future extension to scalable surrogates and richer metadata integration.

Abstract

Paper Structure (25 sections, 9 equations, 4 figures, 2 tables, 2 algorithms)

This paper contains 25 sections, 9 equations, 4 figures, 2 tables, 2 algorithms.

Introduction
Background
Dynamic Multifidelity Hyperparameter Optimization (DyHPO)
Algorithm Overview.
Our Method: Forecasting Model Search (FMS)
Permutation-Invariant Graph Metanetworks (PIGMNs)
Combining DyHPO with PIGMNs for FMS
Experiments
FMS for Fine-tuning on a Dataset
Generalization Performance of the Surrogate Model
Related Work
Hyperparameter Optimization (HPO)
Learning Features from Weight Spaces
Limitations and Future Directions
Conclusion
...and 10 more sections

Figures (4)

Figure 1: We show an overview of our method, Forecasting Model Search (FMS), which builds on DyHPO's multifidelity method from Algorithm \ref{['alg:dyhpo']}. Novel components of FMS are highlighted in blue and further detailed in Algorithm \ref{['alg:fms']}. We include DyHPO's features from the hyperparameter configuration, budget, and learning curve wistuba2023supervising. Notably, we also featurize the model's checkpointed weights $\mathbf{W}$ with a permutation-invariant graph metanetwork (PIGMN) as in Section \ref{['pigms']} for input to a deep kernel GP (see Equation \ref{['deepkernelgp']}/\ref{['eq:weight_kernel']}). This provides the HPO with an -- often pre-existing -- rich source of information, which implicitly includes the architecture, dataset, loss, and optimization process. FMS shows improved predictions about hyperparameter performance across compute budgets (see Table \ref{['ktau-values']}), improved quality of the final selected configuration across compute budgets (see Figure \ref{['fig:regret_over_time']}), and a potential to generalize beyond what was seen in training (see Figure \ref{['fig:fms-transfer']}). Specific design choices for this surrogate model are detailed in Appendix Section \ref{['meta-hyperparams']}.
Figure 2: In each plot, we show the regret against the compute budget across different hubs and various hyperparameter optimization (HPO) methods in each color. The regret values reflect the difference between the actual performance and the best possible performance over time. Lower regret indicates better performance. Our method, FMS-GMN in blue, consistently shows lower regret than the strongest baseline DyHPO in red. This persists over most compute budgets across all hubs, demonstrating that our method is effective for HPO. FMS-NFN in cyan doesn't support diverse architectures, so it only runs on the Simple CNN Hub. Figure \ref{['fig:fms-transfer']} further investigates the generalization of our FMS-GMN method, while Appendix Figure \ref{['fig:regret_over_time_detailed']} shows ablations over our design choices.
Figure 3: We evaluate the ability of our method to generalize to new datasets and architectures. FMS-GMN with generalization shown in blue means the model was trained on multiple datasets. FMS-GMN without generalization shown in red was only trained on the current dataset. The results show that our model can effectively generalize knowledge between different tasks because the generalization setup's regret is consistently lower than the non-generalization setup, showing it converges faster to a potentially higher-quality solution by leveraging the additional datasets.
Figure 4: We show the regret against the compute budget for the hyperparameter optimization (HPO) method across different hubs in each plot and various methods in each color. The regret values reflect the difference between the actual performance and the best possible performance over time. Lower regret indicates better performance. Our method, FMS-GMN, consistently shows lower regret over time across all hubs, demonstrating its effectiveness in HPO. The compute budget is measured in epochs (a full pass through the dataset), standardizing the compute effort across different tasks. FMS-NFN doesn't support diverse architectures, so it only runs on Simple CNN Hub.

Improving Hyperparameter Optimization with Checkpointed Model Weights

TL;DR

Abstract

Improving Hyperparameter Optimization with Checkpointed Model Weights

Authors

TL;DR

Abstract

Table of Contents

Figures (4)