Table of Contents
Fetching ...

Rethinking Probabilistic Circuit Parameter Learning

Anji Liu, Zilei Shao, Guy Van den Broeck

TL;DR

Rethinking Probabilistic Circuit Parameter Learning addresses the scalability gap in training probabilistic circuits by reframing EM as a KL-regularized linearization of the log-likelihood. It shows that existing mini-batch EM and gradient-based methods overfit the current batch due to insufficient regularization of distribution changes, and introduces anemone, a mini-batch EM with an implicit adaptive learning rate per parameter guided by its TD-prob. Anemone yields a closed-form update and preserves local normalization, enabling efficient and stable training. Across language, image, and DNA datasets with diverse PC architectures, anemone achieves faster convergence and higher final log-likelihood than full EM, mini-batch EM, and Adam, demonstrating strong practical scalability.

Abstract

Probabilistic Circuits (PCs) offer a computationally scalable framework for generative modeling, supporting exact and efficient inference of a wide range of probabilistic queries. While recent advances have significantly improved the expressiveness and scalability of PCs, effectively training their parameters remains a challenge. In particular, a widely used optimization method, full-batch Expectation-Maximization (EM), requires processing the entire dataset before performing a single update, making it ineffective for large datasets. Although empirical extensions to the mini-batch setting, as well as gradient-based mini-batch algorithms, converge faster than full-batch EM, they generally underperform in terms of final likelihood. We investigate this gap by establishing a novel theoretical connection between these practical algorithms and the general EM objective. Our analysis reveals a fundamental issue that existing mini-batch EM and gradient-based methods fail to properly regularize distribution changes, causing each update to effectively ``overfit'' the current mini-batch. Motivated by this insight, we introduce anemone, a new mini-batch EM algorithm for PCs. Anemone applies an implicit adaptive learning rate to each parameter, scaled by how much it contributes to the likelihood of the current batch. Across extensive experiments on language, image, and DNA datasets, anemone consistently outperforms existing optimizers in both convergence speed and final performance.

Rethinking Probabilistic Circuit Parameter Learning

TL;DR

Rethinking Probabilistic Circuit Parameter Learning addresses the scalability gap in training probabilistic circuits by reframing EM as a KL-regularized linearization of the log-likelihood. It shows that existing mini-batch EM and gradient-based methods overfit the current batch due to insufficient regularization of distribution changes, and introduces anemone, a mini-batch EM with an implicit adaptive learning rate per parameter guided by its TD-prob. Anemone yields a closed-form update and preserves local normalization, enabling efficient and stable training. Across language, image, and DNA datasets with diverse PC architectures, anemone achieves faster convergence and higher final log-likelihood than full EM, mini-batch EM, and Adam, demonstrating strong practical scalability.

Abstract

Probabilistic Circuits (PCs) offer a computationally scalable framework for generative modeling, supporting exact and efficient inference of a wide range of probabilistic queries. While recent advances have significantly improved the expressiveness and scalability of PCs, effectively training their parameters remains a challenge. In particular, a widely used optimization method, full-batch Expectation-Maximization (EM), requires processing the entire dataset before performing a single update, making it ineffective for large datasets. Although empirical extensions to the mini-batch setting, as well as gradient-based mini-batch algorithms, converge faster than full-batch EM, they generally underperform in terms of final likelihood. We investigate this gap by establishing a novel theoretical connection between these practical algorithms and the general EM objective. Our analysis reveals a fundamental issue that existing mini-batch EM and gradient-based methods fail to properly regularize distribution changes, causing each update to effectively ``overfit'' the current mini-batch. Motivated by this insight, we introduce anemone, a new mini-batch EM algorithm for PCs. Anemone applies an implicit adaptive learning rate to each parameter, scaled by how much it contributes to the likelihood of the current batch. Across extensive experiments on language, image, and DNA datasets, anemone consistently outperforms existing optimizers in both convergence speed and final performance.

Paper Structure

This paper contains 36 sections, 3 theorems, 76 equations, 3 figures, 5 tables.

Key Result

Proposition 1

Given a PC ${p}_{\boldsymbol{\phi}}$ with log-parameters $\boldsymbol{\phi}$ (cf. Def. defn:pc) and a dataset $\mathcal{D}$, $Q_{\boldsymbol{\phi}}^{\mathcal{D}} (\boldsymbol{\phi}')$ equals the following up to a constant term irrelevant to $\boldsymbol{\phi}'$: where $\mathtt{KL}_{\boldsymbol{\phi}} (\boldsymbol{\phi}') := \mathrm{D}_{\mathrm{KL}} \left ( {p}_{\boldsymbol{\phi}} (\mathbf{X}, \m

Figures (3)

  • Figure 1: The proposed algorithm implicitly applies an adaptive learning rate to each node. For the PC shown on the left, given a sample $x \!=\! -1.5$, the algorithm uses a large learning rate to update $n_1$ while keeping $n_2$ almost unchanged.
  • Figure 2: Log-Likelihood over epochs on four diverse datasets. For the ImageNet (YCC-R) and ImageNet (YCC) datasets, an HCLT with hidden size 512 is used; for the WikiText dataset, an HMM with hidden size 256 is used; for the BioBank dataset, a PDHCLT with hidden size 1024 is used. Anemone achieves significantly faster convergence as well as final LL across all four cases.
  • Figure 3: Ablation study on the effect of momentum when combined with anemone and vanilla mini-batch EM, respectively.Left: For anemone optimizer (HMM with hidden size of 256 on WikiText), incorporating momentum improves the final log-likelihood despite slightly slower initial convergence, while still being significantly faster than full-batch EM. Right: In contrast, for mini-batch EM (PDHCLT with hidden size of 512 on BioBank Chr6), momentum provides little benefit.

Theorems & Definitions (9)

  • Definition 1: Probabilistic Circuit
  • Proposition 1
  • Definition 2: TD-prob
  • Lemma 1
  • Definition 3: Smoothness and Decomposability
  • proof : Proof of \ref{['prop:em-general-form']}
  • proof : Proof of \ref{['lem:key-terms']}
  • Lemma 2
  • proof