Frozen Layers: Memory-efficient Many-fidelity Hyperparameter Optimization
Timur Carstensen, Neeratyoy Mallik, Frank Hutter, Martin Rapp
TL;DR
This work introduces layer freezing as a memory-efficient fidelity for multi-fidelity hyperparameter optimization (MF-HPO) in deep learning. By freezing the first $n{-}z$ layers and training the remaining $z$ layers, the method achieves substantial compute and memory savings (often $\geq 2\times$) while preserving strong rank correlations with full training across Transformer and ResNet architectures, enabling effective MF-HPO on budget-constrained hardware. The authors formalize fidelity properties (cost monotonicity and mutual information monotonicity) and demonstrate that layered fidelities can be combined with traditional fidelities (e.g., data tokens) to enhance search efficiency via joint fidelity landscapes and SH-style schedules. The approach broadens the applicability of MF-HPO to memory-limited settings and hardware heterogeneity, with practical implications for cost, accessibility, and energy efficiency in large-scale DL tuning.
Abstract
As model sizes grow, finding efficient and cost-effective hyperparameter optimization (HPO) methods becomes increasingly crucial for deep learning pipelines. While multi-fidelity HPO (MF-HPO) trades off computational resources required for DL training with lower fidelity estimations, existing fidelity sources often fail under lower compute and memory constraints. We propose a novel fidelity source: the number of layers that are trained or frozen during training. For deep networks, this approach offers significant compute and memory savings while preserving rank correlations between hyperparameters at low fidelities compared to full model training. We demonstrate this in our empirical evaluation across ResNets and Transformers and additionally analyze the utility of frozen layers as a fidelity in using GPU resources as a fidelity in HPO, and for a combined MF-HPO with other fidelity sources. This contribution opens new applications for MF-HPO with hardware resources as a fidelity and creates opportunities for improved algorithms navigating joint fidelity spaces.
