Table of Contents
Fetching ...

Tensor networks and efficient descriptions of classical data

Sirui Lu, Márton Kanász-Nagy, Ivan Kukuljan, J. Ignacio Cirac

TL;DR

This work investigates the potential of tensor-network-based machine-learning methods to scale to large image and text datasets and introduces two models to reproduce this scaling: a quantum-inspired random pair toy model and a linguistically motivated Markovian dependency tree model.

Abstract

We investigate the potential of tensor network based machine learning methods to scale to large image and text data sets. For that, we study how the mutual information between a subregion and its complement scales with the subsystem size $L$, similarly to how it is done in quantum many-body physics. We find that for text, the mutual information scales as a power law $L^ν$ with a close to volume law exponent, indicating that text cannot be efficiently described by 1D tensor networks. For images, the scaling is close to an area law, hinting at 2D tensor networks such as PEPS could have an adequate expressibility. For the numerical analysis, we introduce a mutual information estimator based on autoregressive networks, and we also use convolutional neural networks in a neural estimator method.

Tensor networks and efficient descriptions of classical data

TL;DR

This work investigates the potential of tensor-network-based machine-learning methods to scale to large image and text datasets and introduces two models to reproduce this scaling: a quantum-inspired random pair toy model and a linguistically motivated Markovian dependency tree model.

Abstract

We investigate the potential of tensor network based machine learning methods to scale to large image and text data sets. For that, we study how the mutual information between a subregion and its complement scales with the subsystem size , similarly to how it is done in quantum many-body physics. We find that for text, the mutual information scales as a power law with a close to volume law exponent, indicating that text cannot be efficiently described by 1D tensor networks. For images, the scaling is close to an area law, hinting at 2D tensor networks such as PEPS could have an adequate expressibility. For the numerical analysis, we introduce a mutual information estimator based on autoregressive networks, and we also use convolutional neural networks in a neural estimator method.

Paper Structure

This paper contains 24 sections, 39 equations, 11 figures.

Figures (11)

  • Figure 1: (a) Schematic illustration of one-dimensional partitioning into regions $A$ and $B$ of lengths $L$ and $L_{\max}-L$, where $L_{\max}$ is the total system size. (b) Three distinct partitioning schemes for two-dimensional image data: left:right ($\text{L}:\text{R}$), center:surroundings ($\text{C}:\text{S}$), and top:bottom ($\text{T}:\text{B}$). Example images are from MNIST (handwritten digit 5), Fashion-MNIST (T-shirt), and CIFAR-10 (horse head) datasets, with random displacements applied to ensure translational invariance. (c) Tensor network representations and their characteristic mutual information scaling: matrix product states (MPS) exhibit constant scaling $I(A:B) \propto c$, projected entangled pair states (PEPS) show linear scaling with boundary length $I(A:B) \propto cL_A$, and tree tensor network states display logarithmic scaling $I(A:B) \propto c\log L_A$. When area laws hold, the mutual information $I(A:B)$ is bounded by the number of sites near the $A$-$B$ boundary.
  • Figure 2: Autoregressive neural networks and two distinct orderings for computing marginal probabilities. Left: Autoregressive Network 1 processes a $5\times5$ image in raster scan order ($1$ to $25$), computing conditional probabilities $\mathbb{P}_A(\boldsymbol{x})=\prod_{i=1}^{10} p^{\text{AN1}}(x_i|x_{1:i-1})$ to estimate entropy $S(A)$ of the top region. Right: Network 2 uses reverse ordering ($25$ to $1$) to compute $\mathbb{P}_B(\boldsymbol{x})=\prod_{i=1}^{15} p^{\text{AN2}}(x_i|x_{1:i-1})$ for entropy $S(B)$ of the bottom region. These two networks with complementary orderings are trained separately, allowing estimation of marginal entropies for both top and bottom regions independently and thus mutual information via Eq. \ref{['eq:entropy_MI_AN']}.
  • Figure 3: Shannon entropy scaling in image datasets estimated using PixelCNN PixelRCNN and PixelCNN++ architectures Salimans2017PixelCNN. (a) Entropy curves for MNIST and Fashion-MNIST ($28\times28$ pixels) showing volume law scaling $S(L) \propto L$ for both top and bottom regions. The close agreement between top and bottom curves indicates consistent probability distributions captured by the trained models. (b) Corresponding analysis for CIFAR-10 ($32\times32\times3$ pixels) demonstrating similar volume law scaling. The $x$-axis shows normalized region length $L/L_{\max}$, while the $y$-axis displays entropy $S(L)$ in bits.
  • Figure 4: Mutual information scaling in image datasets analyzed using three complementary estimation methods: (i) PixelCNN/PixelCNN++ autoregressive networks computing exact conditional probabilities, (ii) mutual information neural estimation (MINE) using convolutional neural networks as variational functions, and (iii) $k$-nearest neighbor (kNN) density estimation. Left panels show top:bottom ($\text{T}:\text{B}$) partitioning; right panels show center-surroundings ($\text{C}:\text{S}$) partitioning for: (a1--a2) MNIST, (b1--b2) Fashion-MNIST ($28\times28$ pixels), and (c1--c2) CIFAR-10 ($32\times32\times3$ pixels). For MNIST, the simplest dataset, we observe evidence of area law scaling through $\text{T}:\text{B}$ saturation and linear $\text{C}:\text{S}$ growth. For more complex datasets (Fashion-MNIST and CIFAR-10), $I(\text{C}:\text{S})$ still grows linearly but $I(\text{T}:\text{B})$ does not reach a plateau, suggesting faster than area law scaling. The $x$-axis represents the normalized partition length $L/L_{\max}$, and the $y$-axis shows the mutual information $I(A:B)$ in bits, with kNN results globally adjusted for consistent comparison.
  • Figure 5: Mutual information analysis of the WikiText-2 dataset using 50-dimensional GloVe word embeddings. (a) MINE estimates for varying sequence lengths $L_{\max}=50,100,200$ words, showing consistent scaling behavior after normalization. A text convolutional neural network Kim2014Convolutional is used as the score function. (b1) Comparison between MINE and kNN ($k=20$, $\text{MaxNText}=10000$) estimators for $L_{\max}=200$, with power-law fitting region highlighted in gray. (b2) Log-log analysis of the initial part of the $I(L)$ curve revealing power-law scaling $I(L)\propto L^{0.82(2)}$ (blue dashed line) for small $L$, compared with the theoretical upper bound $I(L)\propto L(L_{\max}-L)$ (red dashed line) derived for maximally correlated elements in Sec. \ref{['Sec:ToyModel']}. The $x$-axis represents the normalized length of the left region, $L/L_{\max}$, and the $y$-axis shows the mutual information $I(\text{L}:\text{R})$ in bits.
  • ...and 6 more figures