Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models
Yichen Xu, Yuyang Liang, Shan Dai, Tianyang Hu, Tsz Nam Chan, Chenhao Ma
TL;DR
This work reframes parameter-efficient fine-tuning (PEFT) as layerwise residual correction on a frozen large language model, introducing a local quadratic framework governed by the projected residual norm $\mathrm{Res}_\ell$, activation energy $E_{\mathrm{act}}(\Sigma_\ell)$, and inter-layer coupling $\rho$. It introduces the Layer Card diagnostic to summarize per-layer residual strength, compute cost, and downstream performance, enabling cost-aware, layer-wise placement strategies; experiments on GPT-2 and Qwen3-8B demonstrate that selectively adapting a subset of layers can approach full-layer LoRA performance with substantial reductions in training time and memory, while Layer Card-guided decisions provide robust, transferable guidance across tasks. Theoretical results connect projected residuals to covariance-normalized gradients, show additivity bounds under weak coupling, and extend to nonlinear adapters, thereby offering a principled, scalable approach to layer-aware PEFT. Collectively, the work provides a practical toolkit for balancing performance gains against training and serving costs, enabling flexible deployment under diverse latency and budget constraints.
Abstract
As large language models (LLMs) continue to grow, the cost of full-parameter fine-tuning has made parameter-efficient fine-tuning (PEFT) the default strategy for downstream adaptation. Constraints from inference latency in scalable serving and fine-tuning cost in edge or rapid-deployment settings make the choice of which layers to fine-tune unavoidable. Yet current practice typically applies PEFT uniformly across all layers, with limited understanding or leverage of layer selection. This paper develops a unified projected residual view of PEFT on top of a frozen base model. Under a local quadratic approximation, layerwise adaptation is governed by three quantities: (i) the projected residual norm (resnorm), which measures how much correctable bias a layer can capture; (ii) the activation energy, which determines feature conditioning; and (iii) layer coupling, which quantifies how strongly residuals interact across layers. We show that, for squared loss and linear adapters, the resnorm equals a normalized gradient norm, activation energy controls ill-conditioning and noise amplification, and weak coupling yields approximately additive layerwise contributions. Building on these insights, we introduce the Layer Card, a reusable diagnostic that summarizes residual signal strength, compute cost, and performance for each layer of a given model. With an identical model and LoRA configuration, Layer Card-guided placement refines the choice of adapted layers to flexibly prioritize different objectives, such as maximizing performance or reducing fine-tuning cost. Moreover, on Qwen3-8B, we show that selectively adapting a subset of layers can achieve performance close to full-layer LoRA while substantially reducing fine-tuning cost and the number of adapter-augmented layers during inference, offering a more cost-performance-aware alternative to full-layer insertion.
