Table of Contents
Fetching ...

Product of Experts for Visual Generation

Yunzhi Zhang, Carson Murtuza-Lanier, Zizhang Li, Yilun Du, Jiajun Wu

TL;DR

The paper introduces a training-free Product of Experts (PoE) framework that blends generative priors, discriminative rewards from visual-language models, and physics-based constraints at inference by sampling from a product distribution via Annealed Importance Sampling (AIS) and Sequential Monte Carlo (SMC). It handles heterogeneous experts (flow, autoregressive, and VLM-based rewards) and enables conditional and region-specific sampling to improve controllability in image and video synthesis. Key contributions include a general probabilistic formulation, per-timestep MCMC-based sampling that avoids path-wise weight degeneration, and practical instantiations for graphics-engine editing, physical-simulator–driven video generation, and layout-controlled text-to-image generation. Empirical results show improved adherence to constraints, background/foreground fidelity, and semantic alignment compared with baselines, with a clear demonstration of flexible user interfaces for specifying complex visual goals. The work advances practical controllable generation by allowing diverse knowledge sources to cooperate at inference without retraining, albeit with higher compute costs due to intermediate sampling steps.

Abstract

Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources -- including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators -- remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.

Product of Experts for Visual Generation

TL;DR

The paper introduces a training-free Product of Experts (PoE) framework that blends generative priors, discriminative rewards from visual-language models, and physics-based constraints at inference by sampling from a product distribution via Annealed Importance Sampling (AIS) and Sequential Monte Carlo (SMC). It handles heterogeneous experts (flow, autoregressive, and VLM-based rewards) and enables conditional and region-specific sampling to improve controllability in image and video synthesis. Key contributions include a general probabilistic formulation, per-timestep MCMC-based sampling that avoids path-wise weight degeneration, and practical instantiations for graphics-engine editing, physical-simulator–driven video generation, and layout-controlled text-to-image generation. Empirical results show improved adherence to constraints, background/foreground fidelity, and semantic alignment compared with baselines, with a clear demonstration of flexible user interfaces for specifying complex visual goals. The work advances practical controllable generation by allowing diverse knowledge sources to cooperate at inference without retraining, albeit with higher compute costs due to intermediate sampling steps.

Abstract

Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources -- including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators -- remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.

Paper Structure

This paper contains 49 sections, 10 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Application on Image Object Insertion where the goal is to insert assets posed in a graphics engine (top row) and described with text prompts (bottom row) into images (first column).
  • Figure 2: Image Object Insertion Comparisons. Our method better adheres to input geometric conditions while faithfully preserving background details. The last column shows that conditional sampling improves visual harmonization and fidelity.
  • Figure 3: Application on Physical-Simulation-Instructed Video Generation. Given an input image and a physical simulator rendering describing precise object motions, our method generates videos aligned with input motions while synthesizing natural content for non-foreground regions.
  • Figure 4: Comparisons on Physics-Simulator-Instructed Video Generation. Predictions are processed in grayscale and overlaid with estimated tracking xiao2024spatialtracker for visualization.
  • Figure 5: Physical-Simulation-Instructed Video Generation compared with the baseline with the same backbone. Our method better adheres to the input object motion trajectory.
  • ...and 5 more figures