Scaling Tractable Probabilistic Circuits: A Systems Perspective

Anji Liu; Kareem Ahmed; Guy Van den Broeck

Scaling Tractable Probabilistic Circuits: A Systems Perspective

Anji Liu, Kareem Ahmed, Guy Van den Broeck

TL;DR

This work introduces PyJuice, a GPU-accelerated system that dramatically speeds up training and inference for Probabilistic Circuits (PCs) while reducing memory usage. The key idea is a compilation-based, block-aware representation of PCs that enables efficient block-parallel execution and the use of Tensor Cores, together with PC flows for stable backpropagation. Empirical results show PyJuice outperforms existing systems by orders of magnitude on dense and sparse PCs across image and language tasks, enables training larger models within fixed memory budgets, and sets strong baselines for future PC research. The work demonstrates practical scalability and reproducibility, with code available at https://github.com/Tractables/pyjuice.

Abstract

Probabilistic Circuits (PCs) are a general framework for tractable deep generative models, which support exact and efficient probabilistic inference on their learned distributions. Recent modeling and training advancements have enabled their application to complex real-world tasks. However, the time and memory inefficiency of existing PC implementations hinders further scaling up. This paper proposes PyJuice, a general GPU implementation design for PCs that improves prior art in several regards. Specifically, PyJuice is 1-2 orders of magnitude faster than existing systems (including very recent ones) at training large-scale PCs. Moreover, PyJuice consumes 2-5x less GPU memory, which enables us to train larger models. At the core of our system is a compilation process that converts a PC into a compact representation amenable to efficient block-based parallelization, which significantly reduces IO and makes it possible to leverage Tensor Cores available in modern GPUs. Empirically, PyJuice can be used to improve state-of-the-art PCs trained on image (e.g., ImageNet32) and language (e.g., WikiText, CommonGen) datasets. We further establish a new set of baselines on natural image and language datasets by benchmarking existing PC structures but with much larger sizes and more training epochs, with the hope of incentivizing future research. Code is available at https://github.com/Tractables/pyjuice.

Scaling Tractable Probabilistic Circuits: A Systems Perspective

TL;DR

Abstract

Paper Structure (30 sections, 14 equations, 12 figures, 7 tables, 4 algorithms)

This paper contains 30 sections, 14 equations, 12 figures, 7 tables, 4 algorithms.

Introduction
Preliminaries and Related Work
Key Bottlenecks in PC Parallelization
Harnessing Block-Based PC Parallelization
Fully Connected Sum Layers
Generalizing To Practical Sum Layers
Efficient Implementations by Compiling PC Layers
Analysis: IO and Computation Overhead
Optimizing Backpropagation with PC Flows
Experiments
Faster Models with PyJuice
Better PCs At Scale
Benchmarking Existing PCs
Conclusion
Algorithm Details
...and 15 more sections

Figures (12)

Figure 1: Layering a PC by grouping nodes with the same topological depth (as indicated by the colors) into disjoint subsets. Both the forward and the backward computation can be carried out independently on nodes within the same layer.
Figure 2: Runtime breakdown of the feedforward pass of a PC with $\sim\!\!150$M edges. Both the IO and the computation overhead of the sum layers are significantly larger than the total runtime of product layers. Detailed configurations of the PC are shown in the table.
Figure 3: Illustration of block-based parallelization. A processor computes the output of $2$ sum nodes, by iterating through blocks of $2$ input product nodes and accumulating partial results.
Figure 4: A sum layer (left) with a block-sparse parameter matrix (middle) is compiled into two kernels (right) each with a balanced workload. During execution, each kernel uses the compiled sum/prod/param indices to compute the outputs of $m_0, \dots, m_5$.
Figure 5: Runtime and IO overhead of a sum layer from the PD structure (with $29$K nodes and $30$M edges). The results demonstrate significant performance gains from our block-based parallelization, even with small block sizes.
...and 7 more figures

Theorems & Definitions (2)

Definition 1: Probabilistic Circuit
Definition 2: PC flows

Scaling Tractable Probabilistic Circuits: A Systems Perspective

TL;DR

Abstract

Scaling Tractable Probabilistic Circuits: A Systems Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (12)

Theorems & Definitions (2)