Table of Contents
Fetching ...

TQCompressor: improving tensor decomposition methods in neural networks via permutations

V. Abronin, A. Naumov, D. Mazur, D. Bystrov, K. Tsarova, Ar. Melnikov, I. Oseledets, S. Dolgov, R. Brasher, M. Perelshtein

TL;DR

The paper addresses the challenge of deploying large pre-trained language models in resource-constrained environments by introducing TQCompressor, a permutation-enhanced Kronecker decomposition approach that mitigates expressivity loss during compression. It combines row/column permutations with Kronecker factorization to compress embeddings, MHA, and FFN components, followed by an iterative knowledge-distillation-based training regime. On GPT-2 small, the method yields TQCompressedGPT-2 with 81M parameters, trained using only about 3.1% of the OpenWebText data, and it outperforms DistilGPT-2 and KnGPT-2 in perplexity on standard CLM benchmarks. The approach demonstrates a practical path to efficient and scalable deployment of language models and suggests potential applicability to other neural architectures in resource-limited settings.

Abstract

We introduce TQCompressor, a novel method for neural network model compression with improved tensor decompositions. We explore the challenges posed by the computational and storage demands of pre-trained language models in NLP tasks and propose a permutation-based enhancement to Kronecker decomposition. This enhancement makes it possible to reduce loss in model expressivity which is usually associated with factorization. We demonstrate this method applied to the GPT-2$_{small}$. The result of the compression is TQCompressedGPT-2 model, featuring 81 mln. parameters compared to 124 mln. in the GPT-2$_{small}$. We make TQCompressedGPT-2 publicly available. We further enhance the performance of the TQCompressedGPT-2 through a training strategy involving multi-step knowledge distillation, using only a 3.1% of the OpenWebText. TQCompressedGPT-2 surpasses DistilGPT-2 and KnGPT-2 in comparative evaluations, marking an advancement in the efficient and effective deployment of models in resource-constrained environments.

TQCompressor: improving tensor decomposition methods in neural networks via permutations

TL;DR

The paper addresses the challenge of deploying large pre-trained language models in resource-constrained environments by introducing TQCompressor, a permutation-enhanced Kronecker decomposition approach that mitigates expressivity loss during compression. It combines row/column permutations with Kronecker factorization to compress embeddings, MHA, and FFN components, followed by an iterative knowledge-distillation-based training regime. On GPT-2 small, the method yields TQCompressedGPT-2 with 81M parameters, trained using only about 3.1% of the OpenWebText data, and it outperforms DistilGPT-2 and KnGPT-2 in perplexity on standard CLM benchmarks. The approach demonstrates a practical path to efficient and scalable deployment of language models and suggests potential applicability to other neural architectures in resource-limited settings.

Abstract

We introduce TQCompressor, a novel method for neural network model compression with improved tensor decompositions. We explore the challenges posed by the computational and storage demands of pre-trained language models in NLP tasks and propose a permutation-based enhancement to Kronecker decomposition. This enhancement makes it possible to reduce loss in model expressivity which is usually associated with factorization. We demonstrate this method applied to the GPT-2. The result of the compression is TQCompressedGPT-2 model, featuring 81 mln. parameters compared to 124 mln. in the GPT-2. We make TQCompressedGPT-2 publicly available. We further enhance the performance of the TQCompressedGPT-2 through a training strategy involving multi-step knowledge distillation, using only a 3.1% of the OpenWebText. TQCompressedGPT-2 surpasses DistilGPT-2 and KnGPT-2 in comparative evaluations, marking an advancement in the efficient and effective deployment of models in resource-constrained environments.
Paper Structure (15 sections, 17 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 17 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: This figure illustrates the compression pipeline of a pre-trained GPT-2$_{small}$ model using the our Decomposition algorithm. The process begins with the original, uncompressed model on the left. The compression algorithm is applied in the central part of the diagram, consisting of three main steps: First, the row-column permutation of weight matrices is performed to improve the representability of the matrices for decomposition. Second step involves the Kronecker Decomposition of the permuted weight matrices. Algorithm performs multiple iterations until the desired level of approximation accuracy is achieved. The outcome of this process is the compressed model -- TQCompressedGPT-2, shown on the right side of the diagram. This compressed model retains essential performance characteristics while reducing the overall number of parameters, thereby making it more efficient for deployment and use in resource-constrained environments.
  • Figure 2: Knowledge distillation process for compressing the GPT-2$_{small}$ model into the TQCompressedGPT-2 variant. A distilled subset of the OpenWebText corpus serves as the training data for the TQCompressedGPT-2, the student model, which learns under the guidance of the full-sized GPT-2$_{small}$, the teacher model. This process is designed to preserve high performance in the compressed model by closely matching the teacher's accuracy.
  • Figure 3: Only specific layers undergo a compression process — embedding Layer (E), feed-forward network (FFN), and multi-head attention layer (MHA). This process involves applying row-permutation (P) and column-permutation (C) matrices to the original weight matrices, followed by Kronecker decomposition, represented by matrices A and B. The classifier outputs from both the original and the compressed models are then used in a knowledge distillation framework, where the original GPT-2$_{small}$ model serves as the teacher, and TQCompressedGPT-2 acts as the student. The distillation process is focused on aligning the classifier outputs, thereby preserving the performance of the compressed model relative to its original counterpart.