Table of Contents
Fetching ...

Frozen Layers: Memory-efficient Many-fidelity Hyperparameter Optimization

Timur Carstensen, Neeratyoy Mallik, Frank Hutter, Martin Rapp

TL;DR

This work introduces layer freezing as a memory-efficient fidelity for multi-fidelity hyperparameter optimization (MF-HPO) in deep learning. By freezing the first $n{-}z$ layers and training the remaining $z$ layers, the method achieves substantial compute and memory savings (often $\geq 2\times$) while preserving strong rank correlations with full training across Transformer and ResNet architectures, enabling effective MF-HPO on budget-constrained hardware. The authors formalize fidelity properties (cost monotonicity and mutual information monotonicity) and demonstrate that layered fidelities can be combined with traditional fidelities (e.g., data tokens) to enhance search efficiency via joint fidelity landscapes and SH-style schedules. The approach broadens the applicability of MF-HPO to memory-limited settings and hardware heterogeneity, with practical implications for cost, accessibility, and energy efficiency in large-scale DL tuning.

Abstract

As model sizes grow, finding efficient and cost-effective hyperparameter optimization (HPO) methods becomes increasingly crucial for deep learning pipelines. While multi-fidelity HPO (MF-HPO) trades off computational resources required for DL training with lower fidelity estimations, existing fidelity sources often fail under lower compute and memory constraints. We propose a novel fidelity source: the number of layers that are trained or frozen during training. For deep networks, this approach offers significant compute and memory savings while preserving rank correlations between hyperparameters at low fidelities compared to full model training. We demonstrate this in our empirical evaluation across ResNets and Transformers and additionally analyze the utility of frozen layers as a fidelity in using GPU resources as a fidelity in HPO, and for a combined MF-HPO with other fidelity sources. This contribution opens new applications for MF-HPO with hardware resources as a fidelity and creates opportunities for improved algorithms navigating joint fidelity spaces.

Frozen Layers: Memory-efficient Many-fidelity Hyperparameter Optimization

TL;DR

This work introduces layer freezing as a memory-efficient fidelity for multi-fidelity hyperparameter optimization (MF-HPO) in deep learning. By freezing the first layers and training the remaining layers, the method achieves substantial compute and memory savings (often ) while preserving strong rank correlations with full training across Transformer and ResNet architectures, enabling effective MF-HPO on budget-constrained hardware. The authors formalize fidelity properties (cost monotonicity and mutual information monotonicity) and demonstrate that layered fidelities can be combined with traditional fidelities (e.g., data tokens) to enhance search efficiency via joint fidelity landscapes and SH-style schedules. The approach broadens the applicability of MF-HPO to memory-limited settings and hardware heterogeneity, with practical implications for cost, accessibility, and energy efficiency in large-scale DL tuning.

Abstract

As model sizes grow, finding efficient and cost-effective hyperparameter optimization (HPO) methods becomes increasingly crucial for deep learning pipelines. While multi-fidelity HPO (MF-HPO) trades off computational resources required for DL training with lower fidelity estimations, existing fidelity sources often fail under lower compute and memory constraints. We propose a novel fidelity source: the number of layers that are trained or frozen during training. For deep networks, this approach offers significant compute and memory savings while preserving rank correlations between hyperparameters at low fidelities compared to full model training. We demonstrate this in our empirical evaluation across ResNets and Transformers and additionally analyze the utility of frozen layers as a fidelity in using GPU resources as a fidelity in HPO, and for a combined MF-HPO with other fidelity sources. This contribution opens new applications for MF-HPO with hardware resources as a fidelity and creates opportunities for improved algorithms navigating joint fidelity spaces.

Paper Structure

This paper contains 21 sections, 6 figures.

Figures (6)

  • Figure 1: Training of a partially frozen neural network requires fewer resources: 1) lower compute due to a shorter backward path, 2) lower memory because the activations of the first $z$ layers do not require to be kept in memory for the backward pass, and 3) lower memory due to no optimizer states for the first $z$ layers. The resources requirements are adjustable via $z$.
  • Figure 2: (Left) Hardware resource requirements for Pythia 1.4B biderman-arxiv23a with batch size 1 at each fidelity. Time refers to the time taken per step (forward, backward, and optimizer step). At the lowest fidelity, our method reduces memory requirements by a factor of $\geq3\times$ and speeds-up runtimes by a factor of $\geq4\times$. (Right) Comparison of trainable parameters under layer freezing as fidelity. Different architectures distribute parameters unevenly across layers, resulting in varying computational costs as fidelity increases.
  • Figure 3: The $x$-axis shows the discrete number of layers being trained, starting from the output moving backwards. The highest number of trainable layers represent full model training, therefore, best performance and most cost incurred. Runtime is how long a single step (forward + backward + optimizer step) takes at each fidelity when compared to the fully trainable model. (Left) 14M parameter GPT-2 model trained for 20 tokens per parameter at each fidelity. (Right) ResNet-18 trained on CIFAR-100 for 20 epochs at each fidelity.
  • Figure 4: Rank correlation with full-fidelity validation performance for a 14M parameter GPT-2 and ResNet-18. Each hyperparameter configuration received the same training budget. Refer to \ref{['tab:eligibility-hp-search-spaces']} for the search space of configurations. Each evaluation is treated as a black-box evaluation given trainable layers as fidelity.
  • Figure 5: Hyperparameter rank correlation landscape across the joint fidelity space of trainable layers (y-axis) and training tokens (x-axis) for a 14M GPT-2 model (hyperparameter details in \ref{['tab:eligibility-hp-search-spaces']}). Two black traces represent potential Successive Halving (SH) runs with $\eta=2$: the dashed line shows traditional SH using only data as fidelity, while the solid line demonstrates our proposed approach using both layers and data as fidelities. Markers indicate SH query points, with joint fidelity queries at (1 layer, 12% tokens), (2 layers, 25% tokens), (5 layers, 50% tokens), and single fidelity queries at (all layers, {12, 25, 50}% tokens). The three bottom-row plots visualize rank correlation thresholds of $\{0.6,~0.85,~0.95\}$ respectively. Notably, except at the lowest fidelity where additional layers provide stronger correlations, a joint fidelity approach achieves better correlation with reduced computational cost in both wall-clock time and FLOPs (see \ref{['table:mf-hpo-gains']} for quantitative comparisons).
  • ...and 1 more figures