TQCompressor: improving tensor decomposition methods in neural networks via permutations
V. Abronin, A. Naumov, D. Mazur, D. Bystrov, K. Tsarova, Ar. Melnikov, I. Oseledets, S. Dolgov, R. Brasher, M. Perelshtein
TL;DR
The paper addresses the challenge of deploying large pre-trained language models in resource-constrained environments by introducing TQCompressor, a permutation-enhanced Kronecker decomposition approach that mitigates expressivity loss during compression. It combines row/column permutations with Kronecker factorization to compress embeddings, MHA, and FFN components, followed by an iterative knowledge-distillation-based training regime. On GPT-2 small, the method yields TQCompressedGPT-2 with 81M parameters, trained using only about 3.1% of the OpenWebText data, and it outperforms DistilGPT-2 and KnGPT-2 in perplexity on standard CLM benchmarks. The approach demonstrates a practical path to efficient and scalable deployment of language models and suggests potential applicability to other neural architectures in resource-limited settings.
Abstract
We introduce TQCompressor, a novel method for neural network model compression with improved tensor decompositions. We explore the challenges posed by the computational and storage demands of pre-trained language models in NLP tasks and propose a permutation-based enhancement to Kronecker decomposition. This enhancement makes it possible to reduce loss in model expressivity which is usually associated with factorization. We demonstrate this method applied to the GPT-2$_{small}$. The result of the compression is TQCompressedGPT-2 model, featuring 81 mln. parameters compared to 124 mln. in the GPT-2$_{small}$. We make TQCompressedGPT-2 publicly available. We further enhance the performance of the TQCompressedGPT-2 through a training strategy involving multi-step knowledge distillation, using only a 3.1% of the OpenWebText. TQCompressedGPT-2 surpasses DistilGPT-2 and KnGPT-2 in comparative evaluations, marking an advancement in the efficient and effective deployment of models in resource-constrained environments.
