Table of Contents
Fetching ...

Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT

Younes Javanmard, Tanmoy Pandit, Masoud Mardani

Abstract

Transformer-based language models achieve strong performance across NLP tasks, but their quadratic parameter scaling with hidden dimension makes deployment on resource-constrained hardware expensive. We study Matrix Product Operator (MPO) decomposition as a principled compression method for transformers. MPO factorises weight matrices into chains of low-rank cores, with approximation quality controlled by the bond dimension chi. We replace every nn.Linear layer in PicoGPT, a GPT-2-style character-level language model with about 1M parameters, with an MPOLinear module parameterised as an MPO chain. Cores are initialised either by TT-SVD from pretrained dense weights or from random initialisation, and trained using standard PyTorch autograd without a custom backward pass. We derive balanced factorisation schemes for the five distinct weight shapes in PicoGPT and evaluate bond dimensions chi in {4, 8, 16, 32} on Tiny Shakespeare. MPO compression achieves up to 13x compression per transformer block at chi = 4. At chi = 16, the model uses 191,872 parameters instead of 1,020,224 while retaining 97.7% of baseline token accuracy (51.6% vs 52.8%). Reconstruction error follows the expected trend and is lower for three-site than two-site factorisations at the same bond dimension. The chi = 8 model gives the best accuracy per parameter, exceeding the dense baseline by 2.7x on this metric. These results show that MPO parameterisation is a practical and theoretically grounded alternative to low-rank methods and unstructured pruning for transformer compression.

Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT

Abstract

Transformer-based language models achieve strong performance across NLP tasks, but their quadratic parameter scaling with hidden dimension makes deployment on resource-constrained hardware expensive. We study Matrix Product Operator (MPO) decomposition as a principled compression method for transformers. MPO factorises weight matrices into chains of low-rank cores, with approximation quality controlled by the bond dimension chi. We replace every nn.Linear layer in PicoGPT, a GPT-2-style character-level language model with about 1M parameters, with an MPOLinear module parameterised as an MPO chain. Cores are initialised either by TT-SVD from pretrained dense weights or from random initialisation, and trained using standard PyTorch autograd without a custom backward pass. We derive balanced factorisation schemes for the five distinct weight shapes in PicoGPT and evaluate bond dimensions chi in {4, 8, 16, 32} on Tiny Shakespeare. MPO compression achieves up to 13x compression per transformer block at chi = 4. At chi = 16, the model uses 191,872 parameters instead of 1,020,224 while retaining 97.7% of baseline token accuracy (51.6% vs 52.8%). Reconstruction error follows the expected trend and is lower for three-site than two-site factorisations at the same bond dimension. The chi = 8 model gives the best accuracy per parameter, exceeding the dense baseline by 2.7x on this metric. These results show that MPO parameterisation is a practical and theoretically grounded alternative to low-rank methods and unstructured pruning for transformer compression.

Paper Structure

This paper contains 33 sections, 17 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Tensor network diagram of an MPO weight matrix. Horizontal lines carry virtual (bond) indices of dimension $\chi$; vertical lines are physical indices of size $d_l^{\mathrm{out}}$ (blue, upward) and $d_l^{\mathrm{in}}$ (red, downward). Contracting all bond indices reconstructs the full weight $\widehat{\mathbf{W}}\in\mathbb{R}^{\mathrm{out}\times\mathrm{in}}$.
  • Figure 2: Per-layer MPO reconstruction error versus bond dimension. Three-site decompositions ($L=3$) achieve lower error per parameter than two-site ones ($L=2$) across the range of $\chi$ considered here.
  • Figure 3: Training (left) and validation (right) cross-entropy loss as a function of training step. Higher bond dimensions converge faster and to lower loss values in the train-from-scratch setting studied here.
  • Figure 4: Left: Validation token accuracy during training for all models. Right: Pareto frontier of final validation accuracy versus parameter count. MPO $\chi=16$ retains $97.7\%$ of the dense accuracy at $5.3\times$ parameter compression.

Theorems & Definitions (2)

  • Definition 1: Tensor Train
  • Definition 2: Matrix Product Operator