COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression

Denis Makhov; Dmitriy Shopkhoev; Magauiya Zhussip; Ammar Ali; Baher Mohammad; Stamatios Lefkimmiatis

COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression

Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Baher Mohammad, Stamatios Lefkimmiatis

TL;DR

COMPOT addresses the challenge of post-training Transformer projection compression by using calibration data to estimate a sparse factorization in a whitened space. It enforces an orthogonal dictionary with $D_O^T D_O = I_k$ and a sparse code with $\|s_O_j\|_0 \le s$, reconstructing $\widehat{W} = A S_O$ where $A = L^{-T} D_O$ and $G = X^T X = L L^T$. A one-shot global allocation pools normalized singular values across matrices to determine per-matrix ranks under a model-wide budget. The method yields closed-form dictionary updates via Procrustes and analytic sparse coding, avoiding iterative pursuits, and shows strong improvements over SVD-based and dictionary-learning baselines while remaining compatible with post-training quantization. Across language, vision-language, and audio tasks, COMPOT delivers substantial quality gains at comparable memory budgets, demonstrating practical impact for efficient deployment of large transformers.

Abstract

Post-training compression of Transformer models commonly relies on truncated singular value decomposition (SVD). However, enforcing a single shared subspace can degrade accuracy even at moderate compression. Sparse dictionary learning provides a more flexible union-of-subspaces representation, but existing approaches often suffer from iterative dictionary and coefficient updates. We propose COMPOT (Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers), a training-free compression framework that uses a small calibration dataset to estimate a sparse weight factorization. COMPOT employs orthogonal dictionaries that enable closed-form Procrustes updates for the dictionary and analytical single-step sparse coding for the coefficients, eliminating iterative optimization. To handle heterogeneous layer sensitivity under a global compression budget, COMPOT further introduces a one-shot dynamic allocation strategy that adaptively redistributes layer-wise compression rates. Extensive experiments across diverse architectures and tasks show that COMPOT consistently delivers a superior quality-compression trade-off over strong low-rank and sparse baselines, while remaining fully compatible with post-training quantization for extreme compression. Code is available $\href{https://github.com/mts-ai/COMPOT}{here}$.

COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression

TL;DR

and a sparse code with

, reconstructing

where

and

. A one-shot global allocation pools normalized singular values across matrices to determine per-matrix ranks under a model-wide budget. The method yields closed-form dictionary updates via Procrustes and analytic sparse coding, avoiding iterative pursuits, and shows strong improvements over SVD-based and dictionary-learning baselines while remaining compatible with post-training quantization. Across language, vision-language, and audio tasks, COMPOT delivers substantial quality gains at comparable memory budgets, demonstrating practical impact for efficient deployment of large transformers.

Abstract

Paper Structure (40 sections, 27 equations, 12 figures, 19 tables, 2 algorithms)

This paper contains 40 sections, 27 equations, 12 figures, 19 tables, 2 algorithms.

Introduction
Related Work
Overview of Transformer-based Model Compression
SVD-based Matrix Factorization for Compression
Dictionary Learning and Sparse Coding
Dynamic Allocation of Compression Ratios
Method
Problem Setup
Preliminaries: Subspace vs. Union-of-subspaces Modeling
COMPOT: Calibration-optimized Orthogonal Dictionary Factorization
Experiments
Experimental Setup
Ablation Study
Main Results
Limitations and Conclusion
...and 25 more sections

Figures (12)

Figure 1: Comparison of low-rank decomposition, dictionary learning with sparse coding, and COMPOT. Low-rank uses a rigid shared orthogonal basis$\mathbf B$; dictionary learning enables a flexible union-of-subspaces; COMPOT lies in between by using a union-of-orthogonal-subspaces (denoted with $_O$), which enables fast closed-form dictionary updates and a lightweight coefficient update, improving compression performance.
Figure 2: Overview of COMPOT framework. On the left part the alternating minimization process is visualized while on the right we represent our single-shot strategy for dynamic compression ratio allocation based on singular values of normalized projection matrices.
Figure 3: Average accuracy as a function of the number of alternating minimization steps on Llama3.2-1B at $0.2$ compression, comparing random and SVD-based dictionary initialization.
Figure 4: Results of COMPOT allocation strategy for Llama3.2-1B
Figure 5: Results of COMPOT allocation strategy for Qwen3-0.6B
...and 7 more figures

COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression

TL;DR

Abstract

COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (12)