Demystifying Low-Rank Knowledge Distillation in Large Language Models: Convergence, Generalization, and Information-Theoretic Guarantees

Alberlucia Rafael Soarez; Daniel Kim; Mariana Costa; Alejandro Torre

Demystifying Low-Rank Knowledge Distillation in Large Language Models: Convergence, Generalization, and Information-Theoretic Guarantees

Alberlucia Rafael Soarez, Daniel Kim, Mariana Costa, Alejandro Torre

Abstract

Knowledge distillation has emerged as a powerful technique for compressing large language models (LLMs) into efficient, deployable architectures while preserving their advanced capabilities. Recent advances in low-rank knowledge distillation, particularly methods like Low-Rank Clone (LRC), have demonstrated remarkable empirical success, achieving comparable performance to full-parameter distillation with significantly reduced training data and computational overhead. However, the theoretical foundations underlying these methods remain poorly understood. In this paper, we establish a rigorous theoretical framework for low-rank knowledge distillation in language models. We prove that under mild assumptions, low-rank projection preserves the optimization dynamics, yielding explicit convergence rates of $O(1/\sqrt{T})$. We derive generalization bounds that characterize the fundamental trade-off between model compression and generalization capability, showing that the generalization error scales with the rank parameter as $O(r(m+n)/\sqrt{n})$. Furthermore, we provide an information-theoretic analysis of the activation cloning mechanism, revealing its role in maximizing the mutual information between the teacher's and student's intermediate representations. Our theoretical results offer principled guidelines for rank selection, mathematically suggesting an optimal rank $r^* = O(\sqrt{n})$ where $n$ is the sample size. Experimental validation on standard language modeling benchmarks confirms our theoretical predictions, demonstrating that the empirical convergence, rank scaling, and generalization behaviors align closely with our bounds.

Demystifying Low-Rank Knowledge Distillation in Large Language Models: Convergence, Generalization, and Information-Theoretic Guarantees

Abstract

. We derive generalization bounds that characterize the fundamental trade-off between model compression and generalization capability, showing that the generalization error scales with the rank parameter as

. Furthermore, we provide an information-theoretic analysis of the activation cloning mechanism, revealing its role in maximizing the mutual information between the teacher's and student's intermediate representations. Our theoretical results offer principled guidelines for rank selection, mathematically suggesting an optimal rank

where

is the sample size. Experimental validation on standard language modeling benchmarks confirms our theoretical predictions, demonstrating that the empirical convergence, rank scaling, and generalization behaviors align closely with our bounds.

Paper Structure (35 sections, 5 theorems, 29 equations, 4 figures, 2 tables)

This paper contains 35 sections, 5 theorems, 29 equations, 4 figures, 2 tables.

Introduction
Related Work
Knowledge Distillation and Large Models.
Low-Rank Compression and Efficient Architectures.
Vision-Language and Generative Agents.
Theoretical Analysis and Representation.
Information Theory in Deep Learning.
Preliminaries
Notation
Problem Formulation
Assumptions
Theoretical Analysis
Convergence Analysis
Generalization Bounds
Information-Theoretic Analysis of Activation Cloning
...and 20 more sections

Key Result

Lemma 1

Under Assumptions assump:smooth and assump:lowrank, the gradient after low-rank projection satisfies: where $C$ is a constant depending on the network architecture and the norms of projection matrices.

Figures (4)

Figure 1: Empirical verification of our theoretical bounds. (a) Convergence curves for different methods. LRC converges at rate $O(1/\sqrt{T})$ as predicted by Theorem \ref{['thm:convergence']}. The gray dashed line shows the theoretical bound. (b) The generalization gap increases with rank $r$ as predicted by Theorem \ref{['thm:generalization']}, while test accuracy shows an inverse relationship.
Figure 2: Validation of rank scaling and activation cloning. (a) Empirical optimal ranks align closely with the theoretical prediction $r^* = O(\sqrt{n})$. The shaded region shows a 15% confidence interval. (b) Mutual information between teacher and student hidden states at each layer. With activation cloning (blue), MI is significantly higher than without cloning (orange), validating Theorem \ref{['thm:activation_mi']}.
Figure 3: Parameter sensitivity analysis. The model shows robustness to learning rate variations and exhibits a clear performance sweet spot for the clone weight $\lambda$.
Figure 4: Rank sensitivity. Performance degrades when the rank deviates significantly from the optimal value $r^*$, forming a U-shaped validation curve.

Theorems & Definitions (14)

Lemma 1: Gradient Preservation
proof : Proof Sketch
Theorem 1: Convergence Rate
proof
Corollary 1: Optimal Rank Selection
proof
Definition 1: Rademacher Complexity for Low-Rank Models
Theorem 2: Generalization Bound
proof
Definition 2: Mutual Information for Knowledge Transfer
...and 4 more

Demystifying Low-Rank Knowledge Distillation in Large Language Models: Convergence, Generalization, and Information-Theoretic Guarantees

Abstract

Demystifying Low-Rank Knowledge Distillation in Large Language Models: Convergence, Generalization, and Information-Theoretic Guarantees

Authors

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (14)