Table of Contents
Fetching ...

Tensorization is a powerful but underexplored tool for compression and interpretability of neural networks

Safa Hamreras, Sukhbinder Singh, Román Orús

TL;DR

The paper argues that scaling neural networks demands compression techniques that preserve performance and enhance interpretability. It advocates Tensorized Neural Networks (TNNs), which reshape dense weight matrices $W$ into higher-order tensors and approximate them with low-rank tensor networks (TNs) such as Matrix Product Operators (MPO/TT) and Tucker/CP decompositions, enabling parameter efficiency and revealing latent bond indices. It highlights the stack view as a flexible design that yields multiple equivalent computation paths, offers forward/backward pass acceleration under certain contractions, and opens new interpretability avenues via bond features and tensorized autoencoders. Finally, it outlines practical challenges—hardware/software support, inductive-bias characterization, hyperparameter complexity, and integration with other compression methods—and sketches a roadmap toward fully tensorized networks with tensorized activations and nonlinearities to advance scalable, trustworthy AI.

Abstract

Tensorizing a neural network involves reshaping some or all of its dense weight matrices into higher-order tensors and approximating them using low-rank tensor network decompositions. This technique has shown promise as a model compression strategy for large-scale neural networks. However, despite encouraging empirical results, tensorized neural networks (TNNs) remain underutilized in mainstream deep learning. In this position paper, we offer a perspective on both the potential and current limitations of TNNs. We argue that TNNs represent a powerful yet underexplored framework for deep learning--one that deserves greater attention from both engineering and theoretical communities. Beyond compression, we highlight the value of TNNs as a flexible class of architectures with distinctive scaling properties and increased interpretability. A central feature of TNNs is the presence of bond indices, which introduce new latent spaces not found in conventional networks. These internal representations may provide deeper insight into the evolution of features across layers, potentially advancing the goals of mechanistic interpretability. We conclude by outlining several key research directions aimed at overcoming the practical barriers to scaling and adopting TNNs in modern deep learning workflows.

Tensorization is a powerful but underexplored tool for compression and interpretability of neural networks

TL;DR

The paper argues that scaling neural networks demands compression techniques that preserve performance and enhance interpretability. It advocates Tensorized Neural Networks (TNNs), which reshape dense weight matrices into higher-order tensors and approximate them with low-rank tensor networks (TNs) such as Matrix Product Operators (MPO/TT) and Tucker/CP decompositions, enabling parameter efficiency and revealing latent bond indices. It highlights the stack view as a flexible design that yields multiple equivalent computation paths, offers forward/backward pass acceleration under certain contractions, and opens new interpretability avenues via bond features and tensorized autoencoders. Finally, it outlines practical challenges—hardware/software support, inductive-bias characterization, hyperparameter complexity, and integration with other compression methods—and sketches a roadmap toward fully tensorized networks with tensorized activations and nonlinearities to advance scalable, trustworthy AI.

Abstract

Tensorizing a neural network involves reshaping some or all of its dense weight matrices into higher-order tensors and approximating them using low-rank tensor network decompositions. This technique has shown promise as a model compression strategy for large-scale neural networks. However, despite encouraging empirical results, tensorized neural networks (TNNs) remain underutilized in mainstream deep learning. In this position paper, we offer a perspective on both the potential and current limitations of TNNs. We argue that TNNs represent a powerful yet underexplored framework for deep learning--one that deserves greater attention from both engineering and theoretical communities. Beyond compression, we highlight the value of TNNs as a flexible class of architectures with distinctive scaling properties and increased interpretability. A central feature of TNNs is the presence of bond indices, which introduce new latent spaces not found in conventional networks. These internal representations may provide deeper insight into the evolution of features across layers, potentially advancing the goals of mechanistic interpretability. We conclude by outlining several key research directions aimed at overcoming the practical barriers to scaling and adopting TNNs in modern deep learning workflows.

Paper Structure

This paper contains 5 sections, 6 figures.

Figures (6)

  • Figure 1: (i) A MLP with three fully-connected layers $A, B, C$, depicted as rectangles, and two non-linear layers (here, ReLu and Tanh activation layers), depicted as wiggly lines. The network is tensorized as follows. (ii) First, each weight matrix ($A, B$ and $C$) is reshaped into a higher-dimensional tensor. For example, the $m \times n$ matrix $B$ can be reshaped into an 8-index tensor with 3 input indices $m_1, m_2, m_3$ such that $m_1 m_2 m_3 = m$ and 4 output indices $n_1, n_2, n_3, n_4, n_5$ such that $n_1 n_2 n_3 n_4 n_5 = n$. (iii) Then each such tensor is decomposed as a tensor network, i.e., a contraction (einsum()) of tensors. Three examples of tensor network layers are shown: (a) A matrix product operator (or tensor train) layer, (b) a generic tensor network layer and (c) a tensor ring layer. A tensor network decomposition exposes new degrees of freedom inside the neural network, carried by the bond indices of the network, highlighted in red.
  • Figure 2: (i) A matrix $m_{ij}$ as a 2-index tensor. (ii) Notation for the identity matrix. (iii) Tensor product $c = a \otimes b$ shown by horizontal stacking. (iv) Matrix product $c = ab$ shown by vertical stacking with contraction over shared (red) index. (v) A 3-index tensor $n_{ijk}$. (vi) A 3-index Kronecker delta $\delta_{ijk} = \delta_{ij} \delta_{jk}$. (vii) A general $n$-index tensor $T$. (viii) A tensor network formed by contracting (via e.g. einsum) tensors $P, Q, R$; summed (bond) indices in red.
  • Figure 3: Examples of tensor network layers that we consider in this paper. (i-ii) An MPO (or tensor train) layer obtained by reshaping a pretrained weight $W$ as a higher order tensor and then decomposing it as an MPO via repeated singular value decompositions. (iii) An MPO with all bond dimensions equal to 1 (no red line) corresponds to a hypercube of neurons that are completely uncorrelated across different hypercube dimensions. (iv) A Tucker decomposition of a 2D convolution kernel $K_{xywh}$, where indices $x, y$ and $w, h$ refer to the spatial location of the top-left corner of an image patch and the patch’s width and height, respectively. The bond dimensions on the right correspond to the tucker ranks of the kernel. (v) The Canonical Polyadic decomposition of $K$; this is a special case of the Tucker decomposition where the core tensor $G$ is equal to a delta tensor (Fig. \ref{['fig:tensors']}(iv)).
  • Figure 4: (i) An MPO layer with 4 tensors. (ii) A "stack" view of the same MPO layer as a composition of 4 standard fully-connected layers $\mathcal{A}, \mathcal{B}, \mathcal{C},$ and $\mathcal{D}$. These layers are sparse, each equal to the tensor product of a small matrix obtained by matricizing an MPO tensor and Identity matrices, one for each empty wire [Fig. \ref{['fig:tensors']}(ii)] passing through the layer. In this view, the bond indices of the MPO are simply the input and output dimensions of standard linear layers. (iii) Another distinct stack view of the same MPO layer obtained by vertically shifting the MPO tensors, corresponding to distinct matricizations of the tensors, and transforming the tensors $a,b$ with an invertible matrix $X$ as $a \rightarrow aX^{-1}$ and $b \rightarrow Xb$. The stack comprised of fully-connected layers $\mathcal{A}, \mathcal{B}', \mathcal{C}',$ and $\mathcal{D}'$ is also equal to the original MPO. (iv) A stack view of a more general tensor network fully-connected layer.
  • Figure 5: Forward pass of an MPO layer
  • ...and 1 more figures