On the Effectiveness of Low-Rank Matrix Factorization for LSTM Model Compression
Genta Indra Winata, Andrea Madotto, Jamin Shin, Elham J. Barezi, Pascale Fung
TL;DR
The paper tackles the inefficiency of LSTMs in NLP by applying post-training low-rank matrix factorization to compress LSTM gate weights, comparing Truncated SVD and Semi-NMF against pruning. It finds that compressing the multiplicative recurrence ($W_h$) generally yields better performance and efficiency than compressing the additive recurrence ($W_i$), with strong correlations between weight norms and compression outcomes. Across language modeling on PTB and WikiText-2 and downstream tasks with pre-trained ELMo, the approach achieves about 1.5x faster inference with minimal performance loss, and up to ~2x with fine-tuning, while revealing that $W_h$ is inherently more low-rank than $W_i$. The norm analysis clarifies when MF or pruning is preferable and explains the differing outcomes between $W_i$ and $W_h$, guiding practical compression decisions for NLP models.
Abstract
Despite their ubiquity in NLP tasks, Long Short-Term Memory (LSTM) networks suffer from computational inefficiencies caused by inherent unparallelizable recurrences, which further aggravates as LSTMs require more parameters for larger memory capacity. In this paper, we propose to apply low-rank matrix factorization (MF) algorithms to different recurrences in LSTMs, and explore the effectiveness on different NLP tasks and model components. We discover that additive recurrence is more important than multiplicative recurrence, and explain this by identifying meaningful correlations between matrix norms and compression performance. We compare our approach across two settings: 1) compressing core LSTM recurrences in language models, 2) compressing biLSTM layers of ELMo evaluated in three downstream NLP tasks.
