Table of Contents
Fetching ...

Krony-PT: GPT2 compressed with Kronecker Products

Mohamed Ayoub Ben Ayad, Jelena Mitrovic, Michael Granitzer

TL;DR

This work tackles efficient deployment of decoder-only LLMs by compressing GPT-2 using Kronecker-product factorizations on the FFN weight matrices, producing models in the $81M$–$96M$ range from the original $124M$. Krony-PT introduces two initialization strategies—adaptive normalization for the Van Loan decomposition and a pruning-based method—along with uniform layer compression across all 12 transformer blocks and weight tying of input/output embeddings. An $81M$ Krony-PT variant outperforms DistilGPT-2 on next-token prediction and is competitive with larger Kronecker-compressed GPT-2 models. The work demonstrates the viability of Kronecker-based factorization for efficient LLM deployment and outlines future directions such as faster Kronecker computations and improved interpretability of the factors.

Abstract

We introduce Krony-PT, a compression technique for GPT-2 based on Kronecker products. We specifically target the feed-forward weights of each transformer block, and systematically compress the feed-forward layer matrices to various degrees. We introduce a modified Van Loan decomposition to initialize new Kronecker factors, and also propose a new pruning-based initialization technique. Our method compresses the original 124M-parameter GPT-2 to various smaller models, ranging from 80M to 96M. Our 81M model variant outperforms DistilGPT2 on next-token prediction across all standard language modeling datasets, and shows competitive or comparable performance with significantly larger Kronecker-based compressions of GPT-2.

Krony-PT: GPT2 compressed with Kronecker Products

TL;DR

This work tackles efficient deployment of decoder-only LLMs by compressing GPT-2 using Kronecker-product factorizations on the FFN weight matrices, producing models in the range from the original . Krony-PT introduces two initialization strategies—adaptive normalization for the Van Loan decomposition and a pruning-based method—along with uniform layer compression across all 12 transformer blocks and weight tying of input/output embeddings. An Krony-PT variant outperforms DistilGPT-2 on next-token prediction and is competitive with larger Kronecker-compressed GPT-2 models. The work demonstrates the viability of Kronecker-based factorization for efficient LLM deployment and outlines future directions such as faster Kronecker computations and improved interpretability of the factors.

Abstract

We introduce Krony-PT, a compression technique for GPT-2 based on Kronecker products. We specifically target the feed-forward weights of each transformer block, and systematically compress the feed-forward layer matrices to various degrees. We introduce a modified Van Loan decomposition to initialize new Kronecker factors, and also propose a new pruning-based initialization technique. Our method compresses the original 124M-parameter GPT-2 to various smaller models, ranging from 80M to 96M. Our 81M model variant outperforms DistilGPT2 on next-token prediction across all standard language modeling datasets, and shows competitive or comparable performance with significantly larger Kronecker-based compressions of GPT-2.

Paper Structure

This paper contains 22 sections, 3 equations, 2 figures, 9 tables, 1 algorithm.

Figures (2)

  • Figure 1: An illustration of pruning. Zeroing out alternating rows in a matrix is equivalent to a Kronecker product with the vector $\bigl[1\;0\bigr]^T$.
  • Figure 2: Negative log‐likelihood scores during the first 30% of an epoch on the OpenWebText validation set.