Table of Contents
Fetching ...

RMT-KD: Random Matrix Theoretic Causal Knowledge Distillation

Davide Ettori, Nastaran Darabi, Sureshkumar Senthilkumar, Amit Ranjan Trivedi

TL;DR

RMT-KD addresses the cost of deploying large models by applying Random Matrix Theory to identify informative directions in hidden activations. It iteratively compresses networks by projecting activations onto a causal subspace defined by outlier eigenvalues beyond the MP bulk, coupled with self-distillation to preserve accuracy; the approach uses $\Sigma = \tfrac{1}{n} X X^\top$ and thresholds at $\lambda_\pm = \sigma^2(1 \pm \sqrt{d/n})^2$ with outliers beyond $\lambda_+$. A principled rule based on the BBP-like signal separation guides which directions to retain, avoiding heuristic rank choices, and the training loop runs with $L = \alpha \mathrm{CE}_{\text{task}} + (1-\alpha) \mathrm{KL}(p_{old} \| p_{new})$, until compression targets are met. Empirically, the method yields up to $\approx 80\%$ parameter reduction with about $2\%$ accuracy loss and substantial speedups (around $2.8\times$) and energy savings on GLUE and CIFAR-10, with larger gains for overparameterized transformers and dense, hardware-friendly implementations.

Abstract

Large deep learning models such as BERT and ResNet achieve state-of-the-art performance but are costly to deploy at the edge due to their size and compute demands. We present RMT-KD, a compression method that leverages Random Matrix Theory (RMT) for knowledge distillation to iteratively reduce network size. Instead of pruning or heuristic rank selection, RMT-KD preserves only informative directions identified via the spectral properties of hidden representations. RMT-based causal reduction is applied layer by layer with self-distillation to maintain stability and accuracy. On GLUE and CIFAR-10, RMT-KD achieves up to 80% parameter reduction with only 2% accuracy loss, delivering 2.8x faster inference and nearly halved power consumption. These results establish RMT-KD as a mathematically grounded approach to network distillation.

RMT-KD: Random Matrix Theoretic Causal Knowledge Distillation

TL;DR

RMT-KD addresses the cost of deploying large models by applying Random Matrix Theory to identify informative directions in hidden activations. It iteratively compresses networks by projecting activations onto a causal subspace defined by outlier eigenvalues beyond the MP bulk, coupled with self-distillation to preserve accuracy; the approach uses and thresholds at with outliers beyond . A principled rule based on the BBP-like signal separation guides which directions to retain, avoiding heuristic rank choices, and the training loop runs with , until compression targets are met. Empirically, the method yields up to parameter reduction with about accuracy loss and substantial speedups (around ) and energy savings on GLUE and CIFAR-10, with larger gains for overparameterized transformers and dense, hardware-friendly implementations.

Abstract

Large deep learning models such as BERT and ResNet achieve state-of-the-art performance but are costly to deploy at the edge due to their size and compute demands. We present RMT-KD, a compression method that leverages Random Matrix Theory (RMT) for knowledge distillation to iteratively reduce network size. Instead of pruning or heuristic rank selection, RMT-KD preserves only informative directions identified via the spectral properties of hidden representations. RMT-based causal reduction is applied layer by layer with self-distillation to maintain stability and accuracy. On GLUE and CIFAR-10, RMT-KD achieves up to 80% parameter reduction with only 2% accuracy loss, delivering 2.8x faster inference and nearly halved power consumption. These results establish RMT-KD as a mathematically grounded approach to network distillation.

Paper Structure

This paper contains 4 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Architecture of RMT-KD for iterative distillation. At each stage, hidden layer activations are analyzed with RMT principles to identify causal directions, followed by projection and self-distillation. This process is repeated across layers until the benefits saturate.
  • Figure 2: The empirical eigenvalue distribution of the activation matrices computed on the calibration dataset for BERT-base on the SST dataset.
  • Figure 3: (a, top) Accuracy vs. parameter reduction and (b, bottom) power consumption vs. inference speedup on GLUE datasets ($\sigma^2 =$ median eigenvalue, initial quantile = 50%).
  • Figure 4: Comparison of memory on disk and energy efficiency for all models and GLUE datasets, $\sigma^2 = Median Eigenvalue$, initial quantile = 50%
  • Figure 5: Accuracy–reduction tradeoff for BERT-base (GLUE) and ResNet-50 (CIFAR) as a function of the eigenvalue quantile used to initialize $\sigma^2$. The x-axis shows the quantile, the y-axis shows accuracy (decreasing) and parameter reduction (increasing). The best balance occurs near 40%.