Table of Contents
Fetching ...

Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models

Yichen Xu, Yuyang Liang, Shan Dai, Tianyang Hu, Tsz Nam Chan, Chenhao Ma

TL;DR

This work reframes parameter-efficient fine-tuning (PEFT) as layerwise residual correction on a frozen large language model, introducing a local quadratic framework governed by the projected residual norm $\mathrm{Res}_\ell$, activation energy $E_{\mathrm{act}}(\Sigma_\ell)$, and inter-layer coupling $\rho$. It introduces the Layer Card diagnostic to summarize per-layer residual strength, compute cost, and downstream performance, enabling cost-aware, layer-wise placement strategies; experiments on GPT-2 and Qwen3-8B demonstrate that selectively adapting a subset of layers can approach full-layer LoRA performance with substantial reductions in training time and memory, while Layer Card-guided decisions provide robust, transferable guidance across tasks. Theoretical results connect projected residuals to covariance-normalized gradients, show additivity bounds under weak coupling, and extend to nonlinear adapters, thereby offering a principled, scalable approach to layer-aware PEFT. Collectively, the work provides a practical toolkit for balancing performance gains against training and serving costs, enabling flexible deployment under diverse latency and budget constraints.

Abstract

As large language models (LLMs) continue to grow, the cost of full-parameter fine-tuning has made parameter-efficient fine-tuning (PEFT) the default strategy for downstream adaptation. Constraints from inference latency in scalable serving and fine-tuning cost in edge or rapid-deployment settings make the choice of which layers to fine-tune unavoidable. Yet current practice typically applies PEFT uniformly across all layers, with limited understanding or leverage of layer selection. This paper develops a unified projected residual view of PEFT on top of a frozen base model. Under a local quadratic approximation, layerwise adaptation is governed by three quantities: (i) the projected residual norm (resnorm), which measures how much correctable bias a layer can capture; (ii) the activation energy, which determines feature conditioning; and (iii) layer coupling, which quantifies how strongly residuals interact across layers. We show that, for squared loss and linear adapters, the resnorm equals a normalized gradient norm, activation energy controls ill-conditioning and noise amplification, and weak coupling yields approximately additive layerwise contributions. Building on these insights, we introduce the Layer Card, a reusable diagnostic that summarizes residual signal strength, compute cost, and performance for each layer of a given model. With an identical model and LoRA configuration, Layer Card-guided placement refines the choice of adapted layers to flexibly prioritize different objectives, such as maximizing performance or reducing fine-tuning cost. Moreover, on Qwen3-8B, we show that selectively adapting a subset of layers can achieve performance close to full-layer LoRA while substantially reducing fine-tuning cost and the number of adapter-augmented layers during inference, offering a more cost-performance-aware alternative to full-layer insertion.

Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models

TL;DR

This work reframes parameter-efficient fine-tuning (PEFT) as layerwise residual correction on a frozen large language model, introducing a local quadratic framework governed by the projected residual norm , activation energy , and inter-layer coupling . It introduces the Layer Card diagnostic to summarize per-layer residual strength, compute cost, and downstream performance, enabling cost-aware, layer-wise placement strategies; experiments on GPT-2 and Qwen3-8B demonstrate that selectively adapting a subset of layers can approach full-layer LoRA performance with substantial reductions in training time and memory, while Layer Card-guided decisions provide robust, transferable guidance across tasks. Theoretical results connect projected residuals to covariance-normalized gradients, show additivity bounds under weak coupling, and extend to nonlinear adapters, thereby offering a principled, scalable approach to layer-aware PEFT. Collectively, the work provides a practical toolkit for balancing performance gains against training and serving costs, enabling flexible deployment under diverse latency and budget constraints.

Abstract

As large language models (LLMs) continue to grow, the cost of full-parameter fine-tuning has made parameter-efficient fine-tuning (PEFT) the default strategy for downstream adaptation. Constraints from inference latency in scalable serving and fine-tuning cost in edge or rapid-deployment settings make the choice of which layers to fine-tune unavoidable. Yet current practice typically applies PEFT uniformly across all layers, with limited understanding or leverage of layer selection. This paper develops a unified projected residual view of PEFT on top of a frozen base model. Under a local quadratic approximation, layerwise adaptation is governed by three quantities: (i) the projected residual norm (resnorm), which measures how much correctable bias a layer can capture; (ii) the activation energy, which determines feature conditioning; and (iii) layer coupling, which quantifies how strongly residuals interact across layers. We show that, for squared loss and linear adapters, the resnorm equals a normalized gradient norm, activation energy controls ill-conditioning and noise amplification, and weak coupling yields approximately additive layerwise contributions. Building on these insights, we introduce the Layer Card, a reusable diagnostic that summarizes residual signal strength, compute cost, and performance for each layer of a given model. With an identical model and LoRA configuration, Layer Card-guided placement refines the choice of adapted layers to flexibly prioritize different objectives, such as maximizing performance or reducing fine-tuning cost. Moreover, on Qwen3-8B, we show that selectively adapting a subset of layers can achieve performance close to full-layer LoRA while substantially reducing fine-tuning cost and the number of adapter-augmented layers during inference, offering a more cost-performance-aware alternative to full-layer insertion.
Paper Structure (27 sections, 8 theorems, 85 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 27 sections, 8 theorems, 85 equations, 7 figures, 9 tables, 1 algorithm.

Key Result

Theorem 3.2

Under ass:local_quadratic_pd and assuming $\rho<1$. Let $\Delta\theta_{\mathrm{quad}} :=\theta^{\mathrm{glob}}_{\mathrm{quad}}-\theta^{\mathrm{loc}}_{\mathrm{quad}}$. Then Moreover, if $F(x;\theta)-F(x;0) = J(x)^\top\theta + \tfrac{1}{2}\,\theta^\top K(x)\theta$ with block decomposition $K(x)=D_K(x)+E_K(x)$ and residual coupling $\rho_K(x):=\|D_K(x)^{-1/2}E_K(x)D_K(x)^{-1/2}\|_2<1$, then for all

Figures (7)

  • Figure 1: Overview of the projected residual framework for parameter-efficient fine-tuning
  • Figure 2: Layerwise profiles of projected residual norm, activation energy, and gradient norm across DART, E2E, and WebNLG. Top: GPT-2 Medium. Bottom: GPT-2 Large.
  • Figure 3: Conditioning of layerwise representations under different adapter placement regimes.
  • Figure 4: Relative training time and memory usage for GPT-2 Medium across datasets and adapter placements; all configurations use identical adapter sizes.
  • Figure 5: Task-wise Spearman correlation of projected residual norm rankings. Upper triangle: LLaMA; lower triangle: Qwen.
  • ...and 2 more figures

Theorems & Definitions (16)

  • Theorem 3.2: Approximate additivity of quadratic PEFT residuals
  • Theorem 3.3: Projected residual norm equals normalized gradient norm
  • Proposition 4.1: Spectral geometry governs noise amplification and budget hardness
  • Proposition 5.1: Selective compensation under layer interactions
  • Theorem 5.2: Spreading tuned layers improves quadratic gain by reducing coupling
  • proof : Proof of \ref{['thm:decomposition_main']}
  • proof : Proof of \ref{['thm:layer_grad_residual_equivalence']}
  • proof : Proof of \ref{['prop:spectral-hardness']}
  • Lemma A.1: Interaction norm versus off-diagonal magnitude
  • proof : Proof of \ref{['lem:E_vs_whitened_app']}
  • ...and 6 more