Table of Contents
Fetching ...

Subtractive Mixture Models via Squaring: Representation and Learning

Lorenzo Loconte, Aleksanteri M. Sladek, Stefan Mengel, Martin Trapp, Arno Solin, Nicolas Gillis, Antonio Vergari

TL;DR

This work introduces subtractive mixture models by squaring a linear combination of base components within probabilistic circuits, ensuring non-negativity and enabling tractable normalization. By embedding squared mixtures in tensorized, deep circuit architectures and enforcing structured-decomposability, the authors derive an efficient squaring procedure and stable inference techniques, yielding NPC2 models that can be exponentially more expressive than traditional additive MMs. They connect NPC2s to PSD models and Born machines, providing reductions and showing that squaring can substantially improve distribution estimation in both synthetic and real-world tasks, including GPT-2 distillation. Theoretical results include an exponential expressiveness separation and practical demonstrations across density estimation benchmarks, underscoring NPC2s as a versatile, scalable tool for tractable probabilistic modeling with negative parameters.

Abstract

Mixture models are traditionally represented and learned by adding several distributions as components. Allowing mixtures to subtract probability mass or density can drastically reduce the number of components needed to model complex distributions. However, learning such subtractive mixtures while ensuring they still encode a non-negative function is challenging. We investigate how to learn and perform inference on deep subtractive mixtures by squaring them. We do this in the framework of probabilistic circuits, which enable us to represent tensorized mixtures and generalize several other subtractive models. We theoretically prove that the class of squared circuits allowing subtractions can be exponentially more expressive than traditional additive mixtures; and, we empirically show this increased expressiveness on a series of real-world distribution estimation tasks.

Subtractive Mixture Models via Squaring: Representation and Learning

TL;DR

This work introduces subtractive mixture models by squaring a linear combination of base components within probabilistic circuits, ensuring non-negativity and enabling tractable normalization. By embedding squared mixtures in tensorized, deep circuit architectures and enforcing structured-decomposability, the authors derive an efficient squaring procedure and stable inference techniques, yielding NPC2 models that can be exponentially more expressive than traditional additive MMs. They connect NPC2s to PSD models and Born machines, providing reductions and showing that squaring can substantially improve distribution estimation in both synthetic and real-world tasks, including GPT-2 distillation. Theoretical results include an exponential expressiveness separation and practical demonstrations across density estimation benchmarks, underscoring NPC2s as a versatile, scalable tool for tractable probabilistic modeling with negative parameters.

Abstract

Mixture models are traditionally represented and learned by adding several distributions as components. Allowing mixtures to subtract probability mass or density can drastically reduce the number of components needed to model complex distributions. However, learning such subtractive mixtures while ensuring they still encode a non-negative function is challenging. We investigate how to learn and perform inference on deep subtractive mixtures by squaring them. We do this in the framework of probabilistic circuits, which enable us to represent tensorized mixtures and generalize several other subtractive models. We theoretically prove that the class of squared circuits allowing subtractions can be exponentially more expressive than traditional additive mixtures; and, we empirically show this increased expressiveness on a series of real-world distribution estimation tasks.
Paper Structure (33 sections, 17 theorems, 27 equations, 21 figures, 7 tables, 2 algorithms)

This paper contains 33 sections, 17 theorems, 27 equations, 21 figures, 7 tables, 2 algorithms.

Key Result

Proposition 1

Let $c$ be a tensorized structured-decomposable circuit where the products of functions computed by each input layer can be tractably integrated. Any marginalization of $c^2$ obtained via alg:tensorized-square requires time and space $\mathcal{O}(L\cdot M^2)$.

Figures (21)

  • Figure 1: Shallow MMs and squared NMMs represented as PCs, mapped to a computational graph having input components and a weighted sum unit as output. Squaring a mixture with $K=3$ components (left) can yield more components that share parameters (right).
  • Figure 2: Squaring tensorized structured-decomposable circuits reduces to squaring layers, depicted as colored boxes of (input), (product), and a classic, real deep, Voltaire (sum). Connections to a sum layer are labeled by the matrix parameterizing the layer, while connections to product layers are labeled by the Hadamard product sign (see also \ref{['fig:scalar-circuit-tensorized']}). A tensorized structured-decomposable circuit (b) over three variables defined from the RG in (a) is squared in (c) by recursively squaring each layer via \ref{['alg:tensorized-square']}. Squared layers contain a quadratic number of units, but still output vectors.
  • Figure 3: NPC2 s are better estimators, especially with parameter-efficient input layers. Distribution estimated by monotonic PCs (MPC), squared monotonic PCs (MPC2) and NPC2 s on 2D continuous (above) and discrete (below) data. On continuous data input layers compute splines (\ref{['eq:b-splines']}), while on discrete data they compute either categoricals (for MPC and MPC2), embeddings (for NPC2 s) or Binomials. \ref{['app:experimental-synthetic-continuous', 'app:experimental-synthetic-discrete']} shows log-likelihoods on also additional data.
  • Figure 4: NPC2 s can be more expressive than monotonic PCs (MPCs). Best average log-likelihoods achieved by monotonic PCs ($+$) and NPC2 s ($\pm^2$), built either from randomized linear tree (LT) or binary tree (BT) RGs (see \ref{['app:experimental-uci']}). The scatter plots (left) pairs log-likelihoods based on the number of units per layer $K$ (the higher the darker), differentiating PCs with Gaussian (G/blue) and splines (S/red) input layers. Both axes of each scatter plot are on the same scale, thus the results above the diagonal are of NPC2 s achieving higher log-likelihoods than MPCs at parity of model size. The table (right) shows our models' best average test log-likelihoods and puts them in context with intractable (above) and tractable (below) models w.r.t. variable marginalization.
  • Figure 5: NPC2 s ($\pm^2$) achieve higher log-likelihoods than monotonic PCs ($+$) on data sampled by GPT2. We report the median and the area including 90% of runs by varying the size of layers $K$ and other hyperparameters (\ref{['app:experimental-distillation']}). For comparison, the log-likelihood of GPT2 on the same training data is about $-52$. The difference on the test data is significant for most values of $K$ (see p-values in \ref{['tab:gpt2-distillation-statistical-tests']}).
  • ...and 16 more figures

Theorems & Definitions (38)

  • Definition 1: Tensorized circuit
  • Definition 2: Region graph dennis2012learning
  • Proposition 1: Tractable marginalization of squared circuits
  • Proposition 2: Reduction from PSD models
  • Proposition 3: Reduction from BMs
  • Proposition 4: Squaring deterministic circuits
  • Theorem 1: Expressive efficiency of NPC2 s
  • Definition A.1: Circuit choi2020pcvergari2021compositional
  • Definition A.2: Smoothness and decomposability darwiche2002knowledge
  • Proposition A.1: Tractability choi2020pc
  • ...and 28 more