Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization

Theophilus Amaefuna; Hitesh Vaidya; Anshuman Chhabra; Ankur Mali

Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization

Theophilus Amaefuna, Hitesh Vaidya, Anshuman Chhabra, Ankur Mali

TL;DR

Two convex MDL programs are formulated: a capacity allocation program that distributes expert slots or LoRA rank preferentially to high-curvature layers under diminishing returns, and a pruning program that concentrates sparsity on low-gain layers while protecting high-gain layers from degradation.

Abstract

Layer-wise capacity in large language models is highly non-uniform: some layers contribute disproportionately to loss reduction while others are near-redundant. Existing methods for exploiting this non-uniformity, such as influence-function-based layer scoring, produce sensitivity estimates but offer no principled mechanism for translating them into allocation or pruning decisions under hardware constraints. We address this gap with a unified, curvature-aware framework grounded in the Minimum Description Length (MDL) principle. Our central quantity is the curvature-adjusted layer gain $ζ_k^2 = g_k^\top \widetilde{H}_{kk}^{-1} g_k$, which we show equals twice the maximal second-order reduction in empirical risk achievable by updating layer $k$ alone, and which strictly dominates gradient-norm-based scores by incorporating local curvature. Normalizing these gains into layer quality scores $q_k$, we formulate two convex MDL programs: a capacity allocation program that distributes expert slots or LoRA rank preferentially to high-curvature layers under diminishing returns, and a pruning program that concentrates sparsity on low-gain layers while protecting high-gain layers from degradation. Both programs admit unique closed-form solutions parameterized by a single dual variable, computable in $O(K \log 1/\varepsilon)$ via bisection. We prove an $O(δ^2)$ transfer regret bound showing that source-domain allocations remain near-optimal on target tasks when curvature scores drift by $δ$, with explicit constants tied to the condition number of the target program. Together, these results elevate layer-wise capacity optimization from an empirical heuristic to a theoretically grounded, computationally efficient framework with provable optimality and generalization guarantees.

Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization

TL;DR

Abstract

, which we show equals twice the maximal second-order reduction in empirical risk achievable by updating layer

alone, and which strictly dominates gradient-norm-based scores by incorporating local curvature. Normalizing these gains into layer quality scores

, we formulate two convex MDL programs: a capacity allocation program that distributes expert slots or LoRA rank preferentially to high-curvature layers under diminishing returns, and a pruning program that concentrates sparsity on low-gain layers while protecting high-gain layers from degradation. Both programs admit unique closed-form solutions parameterized by a single dual variable, computable in

via bisection. We prove an

transfer regret bound showing that source-domain allocations remain near-optimal on target tasks when curvature scores drift by

, with explicit constants tied to the condition number of the target program. Together, these results elevate layer-wise capacity optimization from an empirical heuristic to a theoretically grounded, computationally efficient framework with provable optimality and generalization guarantees.

Paper Structure (59 sections, 4 theorems, 45 equations, 4 figures, 5 tables, 2 algorithms)

This paper contains 59 sections, 4 theorems, 45 equations, 4 figures, 5 tables, 2 algorithms.

Introduction
The missing ingredient: curvature.
This work.
Connections to information theory.
Contributions.
Background
Minimum Description Length.
MDL for capacity allocation.
MDL for pruning.
Layer quality via second-order information.
Layer Influence scores and their limitations.
Objective and Notation
Second-Order Expansion and Layer-Restricted Decrease
Second-order Taylor expansion.
Layer-restricted quadratic model.
...and 44 more sections

Key Result

Lemma 1

If $\widetilde{H}_{kk} \succ 0$, the unique minimizer of $\widetilde{Q}_k(d)$ over $d \in \mathbb{R}^{p_k}$ is and the corresponding decrease equals

Figures (4)

Figure 1: (a) Layer-wise curvature scores $\zeta_k^2$ vary substantially across the transformer stack. High-gain layers (dark bars) hold disproportionate reducible risk and should receive additional capacity; low-gain layers (light bars) are candidates for aggressive pruning. (b) Our framework computes $\zeta_k^2$ from per-layer gradients $g_k$ and regularized Hessian blocks $\widetilde{H}_{kk}$, normalizes them to quality scores $q_k$, and solves two convex programs: a capacity allocation program (Theorem \ref{['thm:alloc']}) that enriches high-gain layers, and a pruning program (Theorem \ref{['thm:prune']}) that concentrates sparsity on low-gain layers. Both programs admit closed-form solutions via $O(K\log 1/\varepsilon)$ bisection (Algorithms \ref{['alg:alloc']}--\ref{['alg:prune']}). Theorem \ref{['thm:transfer']} bounds the cost of transferring source-domain allocations to a target domain.
Figure 2: Expert allocation accuracy (%) on Mistral-7B-v0.1 (5 epochs)
Figure 3: Expert allocation accuracy (%) on Gemma-7B, +ve variant only (5 epochs)
Figure : MDL-Optimal Expert Allocation

Theorems & Definitions (10)

Lemma 1: Layer-restricted optimum
Theorem 2: Convexity and closed form for allocation
Theorem 3: Strong convexity and closed form for pruning
Remark : Budget flexibility
proof
proof
proof
Theorem 4: Transfer regret under score drift
proof
Remark : Boundary solutions

Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization

TL;DR

Abstract

Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (10)