Table of Contents
Fetching ...

Efficient Probabilistic Tensor Networks

Marawan Gamal Abdel Hameed, Guillaume Rabusseau

TL;DR

This work proposes a conceptually simple approach for learning PTNs efficiently, that is numerically stable, and enables learning of distributions with 10x more variables than previous approaches when applied to a variety of density estimation benchmarks.

Abstract

Tensor networks (TNs) enable compact representations of large tensors through shared parameters. Their use in probabilistic modeling is particularly appealing, as probabilistic tensor networks (PTNs) allow for tractable computation of marginals. However, existing approaches for learning parameters of PTNs are either computationally demanding and not fully compatible with automatic differentiation frameworks, or numerically unstable. In this work, we propose a conceptually simple approach for learning PTNs efficiently, that is numerically stable. We show our method provides significant improvements in time and space complexity, achieving 10x reduction in latency for generative modeling on the MNIST dataset. Furthermore, our approach enables learning of distributions with 10x more variables than previous approaches when applied to a variety of density estimation benchmarks. Our code is publicly available at github.com/marawangamal/ptn.

Efficient Probabilistic Tensor Networks

TL;DR

This work proposes a conceptually simple approach for learning PTNs efficiently, that is numerically stable, and enables learning of distributions with 10x more variables than previous approaches when applied to a variety of density estimation benchmarks.

Abstract

Tensor networks (TNs) enable compact representations of large tensors through shared parameters. Their use in probabilistic modeling is particularly appealing, as probabilistic tensor networks (PTNs) allow for tractable computation of marginals. However, existing approaches for learning parameters of PTNs are either computationally demanding and not fully compatible with automatic differentiation frameworks, or numerically unstable. In this work, we propose a conceptually simple approach for learning PTNs efficiently, that is numerically stable. We show our method provides significant improvements in time and space complexity, achieving 10x reduction in latency for generative modeling on the MNIST dataset. Furthermore, our approach enables learning of distributions with 10x more variables than previous approaches when applied to a variety of density estimation benchmarks. Our code is publicly available at github.com/marawangamal/ptn.

Paper Structure

This paper contains 22 sections, 5 theorems, 26 equations, 7 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Let the elements of the tensor $\tensor{\boldsymbol{\mathcal{G}}}{} ^{(i)} \in \mathbb{R}^{R_i\times D\times R_{i+1}}$ be i.i.d. random variables drawn from a zero-mean gaussian distribution with unit variance, $R_1 = R_N = 1$ and $R_i = R \quad\forall i \neq 1,N$. Let $\mathbf{y}\in \mathcal{Y},\; where $\dot{ \tensor{\boldsymbol{\mathcal{G}}}{} }_{ij} \triangleq \sum_k \sigma( \tensor{\boldsymb

Figures (7)

  • Figure 1: Comparison between training methods for PTNs. (a) DMRG han2018unsupervised, (b) SGD glasser2019expressive, (c) our method using SGD with logarithmic scale factors (LSF) and (d) latency, memory usage and a measure of instability of the methods. DMRG has exponentially higher latency and memory usage compared with LSF and SGD. However, SGD is numerically unstable. The instability metric is equal to the remaining iterations out of $10^4$ when a numerical overflow is encountered. Even with a modest system size of 100 cores, numerical overflow occurs after just two iterations (see \ref{['app:crown-jewel-hps']} for experimental details).
  • Figure 2: Normalization constant of various PTNs. (a) MPS, (b) CP, and (c) Tensor Tree
  • Figure 3: Magnitude of numerator terms in and Equations \ref{['eq:mps-born']} and \ref{['eq:mps-sigma']} as $N$ is increased for MPS-based models.
  • Figure 4: Illustration of a single update step using the DMRG two site update algorithm used in han2018unsupervised. (1) Cores $\tensor{\boldsymbol{\mathcal{G}}}{} ^{(1)}$ and $\tensor{\boldsymbol{\mathcal{G}}}{} ^{(2)}$ are merged, (2) the loss is computed with respect to the merged fourth order tensor, (3) the gradient is computed and used to update the fourth order tensor using automatic differentiation and (4) the fourth order tensor is decomposed using SVD, then singular vectors are copied into cores $\tensor{\boldsymbol{\mathcal{G}}}{} ^{(1)}$ and $\tensor{\boldsymbol{\mathcal{G}}}{} ^{(2)}$.
  • Figure 5: Maximum number of iterations reached during training using vanilla stochastic gradient descent $\mathrm{MPS}_{\sigma + \mathrm{SGD}}$ vs. stochastic gradient descent with logarithmic scale factors $\mathrm{MPS}_{\sigma + \mathrm{LSF}}$ (ours).
  • ...and 2 more figures

Theorems & Definitions (8)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • proof
  • Theorem 2
  • proof
  • Theorem 2
  • proof