Table of Contents
Fetching ...

HadamRNN: Binary and Sparse Ternary Orthogonal RNNs

Armand Foucault, Franck Mamalet, François Malgouyres

TL;DR

This work tackles the challenge of binarizing recurrent weights in vanilla/orthogonal RNNs by introducing HadamRNN, a binary orthogonal recurrent network parameterized via Hadamard matrices, and its sparse-ternary extension Block-HadamRNN. By constructing $W(u)=\frac{1}{\sqrt{d_h}}\mathrm{diag}(u)\mathrm{S}_{2^{k}}$ (and its block Kronecker variant for sparsity), the authors preserve orthogonality ($W^{\top}W=I$) while enabling learning with STE and quantized inputs/outputs, achieving strong performance on long-horizon tasks (e.g., copy task with 1000 timesteps) and competitive results on MNIST, IMDB, GLUE, and IoT benchmarks. The paper provides a detailed analysis of Hadamard theory, matrix-quantization strategies, model size and compute reductions, and ablations demonstrating the benefits of linear recurrent units and the trade-offs controlled by the sparsity parameter $q$ in Block-HadamRNN. Overall, the results show that highly quantized, edge-friendly RNNs can rival full-precision orthogonal models on a range of sequential and NLP tasks, with the added advantage of far smaller memory footprints and reduced computation. This advances practical deployment of memory-efficient, long-range sequence models on resource-constrained devices. $

Abstract

Binary and sparse ternary weights in neural networks enable faster computations and lighter representations, facilitating their use on edge devices with limited computational power. Meanwhile, vanilla RNNs are highly sensitive to changes in their recurrent weights, making the binarization and ternarization of these weights inherently challenging. To date, no method has successfully achieved binarization or ternarization of vanilla RNN weights. We present a new approach leveraging the properties of Hadamard matrices to parameterize a subset of binary and sparse ternary orthogonal matrices. This method enables the training of orthogonal RNNs (ORNNs) with binary and sparse ternary recurrent weights, effectively creating a specific class of binary and sparse ternary vanilla RNNs. The resulting ORNNs, called HadamRNN and Block-HadamRNN, are evaluated on benchmarks such as the copy task, permuted and sequential MNIST tasks, the IMDB dataset, two GLUE benchmarks, and two IoT benchmarks. Despite binarization or sparse ternarization, these RNNs maintain performance levels comparable to state-of-the-art full-precision models, highlighting the effectiveness of our approach. Notably, our approach is the first solution with binary recurrent weights capable of tackling the copy task over 1000 timesteps.

HadamRNN: Binary and Sparse Ternary Orthogonal RNNs

TL;DR

This work tackles the challenge of binarizing recurrent weights in vanilla/orthogonal RNNs by introducing HadamRNN, a binary orthogonal recurrent network parameterized via Hadamard matrices, and its sparse-ternary extension Block-HadamRNN. By constructing (and its block Kronecker variant for sparsity), the authors preserve orthogonality () while enabling learning with STE and quantized inputs/outputs, achieving strong performance on long-horizon tasks (e.g., copy task with 1000 timesteps) and competitive results on MNIST, IMDB, GLUE, and IoT benchmarks. The paper provides a detailed analysis of Hadamard theory, matrix-quantization strategies, model size and compute reductions, and ablations demonstrating the benefits of linear recurrent units and the trade-offs controlled by the sparsity parameter in Block-HadamRNN. Overall, the results show that highly quantized, edge-friendly RNNs can rival full-precision orthogonal models on a range of sequential and NLP tasks, with the added advantage of far smaller memory footprints and reduced computation. This advances practical deployment of memory-efficient, long-range sequence models on resource-constrained devices. $

Abstract

Binary and sparse ternary weights in neural networks enable faster computations and lighter representations, facilitating their use on edge devices with limited computational power. Meanwhile, vanilla RNNs are highly sensitive to changes in their recurrent weights, making the binarization and ternarization of these weights inherently challenging. To date, no method has successfully achieved binarization or ternarization of vanilla RNN weights. We present a new approach leveraging the properties of Hadamard matrices to parameterize a subset of binary and sparse ternary orthogonal matrices. This method enables the training of orthogonal RNNs (ORNNs) with binary and sparse ternary recurrent weights, effectively creating a specific class of binary and sparse ternary vanilla RNNs. The resulting ORNNs, called HadamRNN and Block-HadamRNN, are evaluated on benchmarks such as the copy task, permuted and sequential MNIST tasks, the IMDB dataset, two GLUE benchmarks, and two IoT benchmarks. Despite binarization or sparse ternarization, these RNNs maintain performance levels comparable to state-of-the-art full-precision models, highlighting the effectiveness of our approach. Notably, our approach is the first solution with binary recurrent weights capable of tackling the copy task over 1000 timesteps.

Paper Structure

This paper contains 57 sections, 4 theorems, 35 equations, 3 figures, 10 tables.

Key Result

Proposition 3.2

Let $k \geq 1$. The $2^k \times 2^k$ matrix, denoted $\mathop{\mathrm{\mathbf{S}}}\nolimits_{2^{k}}$, defined recursively by is a Hadamard matrix. It is called the Sylvester matrixThese matrices are also called Walsh matrices in some contexts. of size $2^k$horadam2007hadamard.

Figures (3)

  • Figure 1: Position of each model in the (size, performance) plane, on pMNIST. The most effective models are located in the upper-left corner of \ref{['fig:perf_and_size']}. The parameter $p$ corresponds to the bitwidth of the quantized matrices $U$ and $V$, as introduced in \ref{['quant_U_V-sec']}. $\textit{FP}$ stands for full-precision.
  • Figure 2: Position of HadamRNN and Block-HadamRNN models for different values of $q$ in the (complexity, performance) plane. The most effective models appear in the lower-left corner of \ref{['fig:perf_and_complexity']}. The bitwidth of the quantized matrices $U$ and $V$ is set to $p=4$.
  • Figure 3: Top: ${\mathbf e}^{lin}$ and ${\mathbf f}^{lin}$, bottom: ${\mathbf e}^{ReLU}$ and ${\mathbf f}^{ReLU}$, see \ref{['lese']} and \ref{['lesf']}.

Theorems & Definitions (8)

  • Definition 3.1
  • Proposition 3.2
  • Proposition 3.3
  • Definition B.1
  • Proposition B.2
  • proof
  • Proposition H.1
  • proof