On the Efficient Marginalization of Probabilistic Sequence Models

Alex Boyd

On the Efficient Marginalization of Probabilistic Sequence Models

Alex Boyd

TL;DR

This dissertation focuses on using autoregressive models to answer complex probabilistic queries that go beyond single-step prediction, such as the timing of future events or the likelihood of a specific event occurring before another.

Abstract

Real-world data often exhibits sequential dependence, across diverse domains such as human behavior, medicine, finance, and climate modeling. Probabilistic methods capture the inherent uncertainty associated with prediction in these contexts, with autoregressive models being especially prominent. This dissertation focuses on using autoregressive models to answer complex probabilistic queries that go beyond single-step prediction, such as the timing of future events or the likelihood of a specific event occurring before another. In particular, we develop a broad class of novel and efficient approximation techniques for marginalization in sequential models that are model-agnostic. These techniques rely solely on access to and sampling from next-step conditional distributions of a pre-trained autoregressive model, including both traditional parametric models as well as more recent neural autoregressive models. Specific approaches are presented for discrete sequential models, for marked temporal point processes, and for stochastic jump processes, each tailored to a well-defined class of informative, long-range probabilistic queries.

On the Efficient Marginalization of Probabilistic Sequence Models

TL;DR

Abstract

Paper Structure (205 sections, 6 theorems, 170 equations, 37 figures, 11 tables, 1 algorithm)

This paper contains 205 sections, 6 theorems, 170 equations, 37 figures, 11 tables, 1 algorithm.

Introduction
Motivating Use Cases
Contextual Queries
Forecasting
Missing Data
Contributions and Dissertation Outline
Background
Notation
Relevant Probability Theory
Probability and Random Variables
Conditioning and Organization of Information
Limiting Theorems
Expectation Approximation Techniques
Sequential Models
Autoregressive Modeling of Categorical Sequences
...and 190 more sections

Key Result

Theorem 2.1

casella2021statistical Let $X, X_1, X_2, \dots \overset{iid}{\sim} F_X$ where $\mu=\mathbb{E}^\mathbb{P}\left[X\right]$ exists and is finite, and denote the sample mean as $\overline{X}_n := \frac{1}{n}\sum_{i=1}^n X_i$. By the strong law of large numbers it follows that $\mathbb{P}\left(\lim_{n\rig

Figures (37)

Figure 1: (top) Illustration of a query for the probability of a given sentence "In my opinion..." ending in $K$ steps. (bottom) GPT-2 gpt2-radford hitting time estimates for sentence ending across 4 prefixes with $V=50,257, K\leq30$. Importance sampling query estimates maintain a 6x reduction in variance relative to naive model sampling for the same computation budget. Open-ended prefixes (top-left) generally possess longer-tailed distributions relative to simple prefixes. Almost no probability mass is found for $K=1$ due to the extremely high likelihood that at least one more token succeeds the prompts prior to ending in order to ensure proper grammar. Additional details provided in \ref{['sec:3_methods', 'sec:3_experiments']} and \ref{['sec:3_gpt_exp']}.
Figure 2: (left) Tree diagram of the complete sequence space for a vocabulary $\mathcal{X}=\{a,b,c\}$ and the corresponding query space $\mathcal{Q}$ (right) for when the first appearance of $a$ occurs on the third step (i.e., $\text{hit}(a)=3$), defined as the set product of restricted domains listed below the figure.
Figure 3: Median relative absolute error (RAE) between estimated probability and (surrogate) ground truth for $p_\theta(\text{hit}(\cdot)=K)$ for importance sampling, beam search, and the hybrid method. As query path space grows with $K$, beam search quickly fails to bound ground truth while sampling remains robust, with the hybrid consistently outperforming all other methods, especially for large values of $K$. Ground truth values used to determine error are exact for $K \leq 4$ and approximated otherwise.
Figure 4: (a) RAE vs restricted entropy per query (with best linear fits), (b) Median RAE versus model temperature $T$ for Mobile App data. All errors computed using the same queries as in \ref{['fig:3_err_plot']}. Beam search errors correlate highly with model entropy even with the low-entropy Mobile Apps dataset, where increasing temperature $T$ directly induces this failure mode.
Figure 5: Median relative efficiency (over 1000 query histories and all vocabulary terms) of importance sampling estimation of the $K$-step marginal distributions for each dataset. The gray, dotted line represents 100% relative efficiency defined by naive query estimation. Relative efficiency is documented for $4 \leq K \leq 15$ to highlight the regime where ground truth cannot be tractably computed.
...and 32 more figures

Theorems & Definitions (9)

Theorem 2.1
Theorem 2.2
Theorem 2.3
Theorem 3.1
proof
Theorem 4.1
proof
Lemma 4.1
proof

On the Efficient Marginalization of Probabilistic Sequence Models

TL;DR

Abstract

On the Efficient Marginalization of Probabilistic Sequence Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (37)

Theorems & Definitions (9)