Table of Contents
Fetching ...

Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms

Jiaming Song, Linqi Zhou

TL;DR

The paper addresses stagnation in generative pre-training algorithm design by proposing an inference-time perspective centered on two axes: sequence length for autoregressive models and refinement steps for diffusion models. It advocates designing inference algorithms before training to ensure capacity and efficiency at inference, and demonstrates this approach with Inductive Moment Matching (IMM) in the continuous domain. A key result is that making the target timestep $s$ an input to the velocity network $v_ heta$ resolves the DDIM limitation, yielding a stable single-stage sampler with superior sample quality and over $10\times$ inference efficiency without relying on score-based methods. The discussion also highlights challenges in discrete multi-token prediction due to naive Bayes assumptions and calls for well-specified joint-token modeling to enable true parallel sampling, arguing that embracing inference-time scaling can bridge autoregressive and diffusion paradigms and advance multimodal pre-training.

Abstract

Recent years have seen significant advancements in foundation models through generative pre-training, yet algorithmic innovation in this space has largely stagnated around autoregressive models for discrete signals and diffusion models for continuous signals. This stagnation creates a bottleneck that prevents us from fully unlocking the potential of rich multi-modal data, which in turn limits the progress on multimodal intelligence. We argue that an inference-first perspective, which prioritizes scaling efficiency during inference time across sequence length and refinement steps, can inspire novel generative pre-training algorithms. Using Inductive Moment Matching (IMM) as a concrete example, we demonstrate how addressing limitations in diffusion models' inference process through targeted modifications yields a stable, single-stage algorithm that achieves superior sample quality with over an order of magnitude greater inference efficiency.

Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms

TL;DR

The paper addresses stagnation in generative pre-training algorithm design by proposing an inference-time perspective centered on two axes: sequence length for autoregressive models and refinement steps for diffusion models. It advocates designing inference algorithms before training to ensure capacity and efficiency at inference, and demonstrates this approach with Inductive Moment Matching (IMM) in the continuous domain. A key result is that making the target timestep an input to the velocity network resolves the DDIM limitation, yielding a stable single-stage sampler with superior sample quality and over inference efficiency without relying on score-based methods. The discussion also highlights challenges in discrete multi-token prediction due to naive Bayes assumptions and calls for well-specified joint-token modeling to enable true parallel sampling, arguing that embracing inference-time scaling can bridge autoregressive and diffusion paradigms and advance multimodal pre-training.

Abstract

Recent years have seen significant advancements in foundation models through generative pre-training, yet algorithmic innovation in this space has largely stagnated around autoregressive models for discrete signals and diffusion models for continuous signals. This stagnation creates a bottleneck that prevents us from fully unlocking the potential of rich multi-modal data, which in turn limits the progress on multimodal intelligence. We argue that an inference-first perspective, which prioritizes scaling efficiency during inference time across sequence length and refinement steps, can inspire novel generative pre-training algorithms. Using Inductive Moment Matching (IMM) as a concrete example, we demonstrate how addressing limitations in diffusion models' inference process through targeted modifications yields a stable, single-stage algorithm that achieves superior sample quality with over an order of magnitude greater inference efficiency.

Paper Structure

This paper contains 6 sections, 2 equations, 1 figure.

Figures (1)

  • Figure 1: Left shows traditional one-step DDIM sampling under Flow Matching construction. Given ${\mathbf{x}}_t$ and $t$, ${\mathbf{x}}_s$ is produced by following the prediced velocity field $v_\theta({\mathbf{x}}_t,t)$. However, the model does not have enough capacity to land on the ODE flow result at $s$ in one step because $v_\theta({\mathbf{x}}_t, t)$ is unaware of $s$ and thus cannot approximate any function over $s$. A practical fix, on the right, simply injects $s$ into our network and now the model has enough capacity to approximate a direct jump towards the correct solution.