Table of Contents
Fetching ...

ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models

Rui Xu, Jiepeng Wang, Hao Pan, Yang Liu, Xin Tong, Shiqing Xin, Changhe Tu, Taku Komura, Wenping Wang

TL;DR

ComboStoc tackles the under-explored issue of combinatorial complexity in diffusion models by introducing asynchronous diffusion schedules that fully sample the space spanned by dimensions and attributes. This simple modification broadens network coverage, accelerates training, and enables new test-time capabilities such as partial preservation and graded conditioning across patches, parts, and features. Empirical results in images (ImageNet) and structured 3D shapes (PartNet) show systematic improvements in FID/FPD/MMD/COV and enable diverse generation tasks, including shape completion and part assembly. The approach provides a practical, broadly applicable principle for leveraging combinatorial structure in diffusion models, with significant implications for controllable generation across modalities.

Abstract

In this paper, we study an under-explored but important factor of diffusion generative models, i.e., the combinatorial complexity. Data samples are generally high-dimensional, and for various structured generation tasks, there are additional attributes which are combined to associate with data samples. We show that the space spanned by the combination of dimensions and attributes is insufficiently sampled by existing training scheme of diffusion generative models, causing degraded test time performance. We present a simple fix to this problem by constructing stochastic processes that fully exploit the combinatorial structures, hence the name ComboStoc. Using this simple strategy, we show that network training is significantly accelerated across diverse data modalities, including images and 3D structured shapes. Moreover, ComboStoc enables a new way of test time generation which uses insynchronized time steps for different dimensions and attributes, thus allowing for varying degrees of control over them.

ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models

TL;DR

ComboStoc tackles the under-explored issue of combinatorial complexity in diffusion models by introducing asynchronous diffusion schedules that fully sample the space spanned by dimensions and attributes. This simple modification broadens network coverage, accelerates training, and enables new test-time capabilities such as partial preservation and graded conditioning across patches, parts, and features. Empirical results in images (ImageNet) and structured 3D shapes (PartNet) show systematic improvements in FID/FPD/MMD/COV and enable diverse generation tasks, including shape completion and part assembly. The approach provides a practical, broadly applicable principle for leveraging combinatorial structure in diffusion models, with significant implications for controllable generation across modalities.

Abstract

In this paper, we study an under-explored but important factor of diffusion generative models, i.e., the combinatorial complexity. Data samples are generally high-dimensional, and for various structured generation tasks, there are additional attributes which are combined to associate with data samples. We show that the space spanned by the combination of dimensions and attributes is insufficiently sampled by existing training scheme of diffusion generative models, causing degraded test time performance. We present a simple fix to this problem by constructing stochastic processes that fully exploit the combinatorial structures, hence the name ComboStoc. Using this simple strategy, we show that network training is significantly accelerated across diverse data modalities, including images and 3D structured shapes. Moreover, ComboStoc enables a new way of test time generation which uses insynchronized time steps for different dimensions and attributes, thus allowing for varying degrees of control over them.
Paper Structure (22 sections, 4 equations, 30 figures, 5 tables)

This paper contains 22 sections, 4 equations, 30 figures, 5 tables.

Figures (30)

  • Figure 1: ComboStoc improves diffusion generative models across data modalities of images and structured 3D shapes. Left: structured 3D shapes where semantic parts are colored randomly. Right: images with consistently lower Frechet Inception Distance (FID) than baseline results.
  • Figure 2: ComboStoc enables better coverage of the whole path space. Assuming two-dimensional data samples. (a) the standard linear one-sided interpolant model reduces its density as it approaches individual data samples; the low density regions are not well trained and once sampled would produce low-quality predictions. (b) using ComboStoc, for each pair of source and target sample points, a whole linear subspace spanned with their connection as the diagonal will be sufficiently sampled, so that there are fewer low-density regions not well trained. (c) when the network is trained to predict velocity $\vb{x}_1 - \vb{z}$, on an off-diagonal sample point $\vb{x}_{\vb{t}}$, a compensation drift ($\vb{v}_{cmpn}$, green vector) is needed to pull the trajectory back to diagonal.
  • Figure 3: Comparison on image generation with respect to training steps. (a) plots the baseline SiT and our model, as well as DiT for reference; all models are of the scale XL/2 SiT_Ma2024. (b) plots the different settings using varying degrees of combinatorial stochasticity.
  • Figure 4: Results of image generation at different training steps. Settings with stronger combinatorial sampling produce well-structured images earlier; e.g. see the koala bear faces and cat eyes.
  • Figure 5: Results of structured shape generation by different settings. Semantic parts are colored randomly. Settings exploiting stronger combinatorial stochasticity show better results. In comparison, insync_none that does not apply ComboStoc nearly fails to generate meaningful shapes.
  • ...and 25 more figures