Table of Contents
Fetching ...

On the Effectiveness of Low-Rank Matrix Factorization for LSTM Model Compression

Genta Indra Winata, Andrea Madotto, Jamin Shin, Elham J. Barezi, Pascale Fung

TL;DR

The paper tackles the inefficiency of LSTMs in NLP by applying post-training low-rank matrix factorization to compress LSTM gate weights, comparing Truncated SVD and Semi-NMF against pruning. It finds that compressing the multiplicative recurrence ($W_h$) generally yields better performance and efficiency than compressing the additive recurrence ($W_i$), with strong correlations between weight norms and compression outcomes. Across language modeling on PTB and WikiText-2 and downstream tasks with pre-trained ELMo, the approach achieves about 1.5x faster inference with minimal performance loss, and up to ~2x with fine-tuning, while revealing that $W_h$ is inherently more low-rank than $W_i$. The norm analysis clarifies when MF or pruning is preferable and explains the differing outcomes between $W_i$ and $W_h$, guiding practical compression decisions for NLP models.

Abstract

Despite their ubiquity in NLP tasks, Long Short-Term Memory (LSTM) networks suffer from computational inefficiencies caused by inherent unparallelizable recurrences, which further aggravates as LSTMs require more parameters for larger memory capacity. In this paper, we propose to apply low-rank matrix factorization (MF) algorithms to different recurrences in LSTMs, and explore the effectiveness on different NLP tasks and model components. We discover that additive recurrence is more important than multiplicative recurrence, and explain this by identifying meaningful correlations between matrix norms and compression performance. We compare our approach across two settings: 1) compressing core LSTM recurrences in language models, 2) compressing biLSTM layers of ELMo evaluated in three downstream NLP tasks.

On the Effectiveness of Low-Rank Matrix Factorization for LSTM Model Compression

TL;DR

The paper tackles the inefficiency of LSTMs in NLP by applying post-training low-rank matrix factorization to compress LSTM gate weights, comparing Truncated SVD and Semi-NMF against pruning. It finds that compressing the multiplicative recurrence () generally yields better performance and efficiency than compressing the additive recurrence (), with strong correlations between weight norms and compression outcomes. Across language modeling on PTB and WikiText-2 and downstream tasks with pre-trained ELMo, the approach achieves about 1.5x faster inference with minimal performance loss, and up to ~2x with fine-tuning, while revealing that is inherently more low-rank than . The norm analysis clarifies when MF or pruning is preferable and explains the differing outcomes between and , guiding practical compression decisions for NLP models.

Abstract

Despite their ubiquity in NLP tasks, Long Short-Term Memory (LSTM) networks suffer from computational inefficiencies caused by inherent unparallelizable recurrences, which further aggravates as LSTMs require more parameters for larger memory capacity. In this paper, we propose to apply low-rank matrix factorization (MF) algorithms to different recurrences in LSTMs, and explore the effectiveness on different NLP tasks and model components. We discover that additive recurrence is more important than multiplicative recurrence, and explain this by identifying meaningful correlations between matrix norms and compression performance. We compare our approach across two settings: 1) compressing core LSTM recurrences in language models, 2) compressing biLSTM layers of ELMo evaluated in three downstream NLP tasks.

Paper Structure

This paper contains 17 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Factorized LSTM Cell
  • Figure 2: Norm analysis comparisons between MF and Pruning in Language Modeling (PTB) and ELMo. Rank versus (a) $\sigma(\|\mathbf{W}_i\|_1)$ (b) $\sigma(\|\mathbf{W}_h\|_1)$ (c) $\|\mathbf{W}_i\|_{1}$ (d) $\|\mathbf{W}_h\|_{1}$ (e) $\|\mathbf{W}_i\|_{Nuc}$ (f) $\|\mathbf{W}_h\|_{Nuc}$.
  • Figure 3: Heatmap LSTM weights on PTB.
  • Figure 4: Heatmap of ELMo forward weights.