Table of Contents
Fetching ...

Improving Diffusion-Based Image Synthesis with Context Prediction

Ling Yang, Jingwei Liu, Shenda Hong, Zhilong Zhang, Zhilin Huang, Zheming Cai, Wentao Zhang, Bin Cui

TL;DR

ConQuestion: The paper addresses the limitation of diffusion models that ignore local neighborhood context during generation. Method: It introduces ConPreDiff, a training-time context prediction module that enforces each point to predict multi-stride neighborhood context using a neighborhood-distribution decoding via Wasserstein distance, without adding inference-time parameters. Contributions: (i) first context-prediction framework for diffusion-based synthesis, (ii) efficient large-context decoding with a Wasserstein-based objective, and (iii) consistent improvements across text-to-image, image inpainting, and unconditional generation, achieving SOTA MS-COCO text-to-image results. Impact: The approach enhances local semantic continuity and fidelity in diffusion outputs, with broad applicability to discrete and continuous backbones and practical gains in image synthesis tasks.

Abstract

Diffusion models are a new class of generative models, and have dramatically promoted image generation with unprecedented quality and diversity. Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes. However, such point-based reconstruction may fail to make each predicted pixel/feature fully preserve its neighborhood context, impairing diffusion-based image synthesis. As a powerful source of automatic supervisory signal, context has been well studied for learning representations. Inspired by this, we for the first time propose ConPreDiff to improve diffusion-based image synthesis with context prediction. We explicitly reinforce each point to predict its neighborhood context (i.e., multi-stride features/tokens/pixels) with a context decoder at the end of diffusion denoising blocks in training stage, and remove the decoder for inference. In this way, each point can better reconstruct itself by preserving its semantic connections with neighborhood context. This new paradigm of ConPreDiff can generalize to arbitrary discrete and continuous diffusion backbones without introducing extra parameters in sampling procedure. Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.

Improving Diffusion-Based Image Synthesis with Context Prediction

TL;DR

ConQuestion: The paper addresses the limitation of diffusion models that ignore local neighborhood context during generation. Method: It introduces ConPreDiff, a training-time context prediction module that enforces each point to predict multi-stride neighborhood context using a neighborhood-distribution decoding via Wasserstein distance, without adding inference-time parameters. Contributions: (i) first context-prediction framework for diffusion-based synthesis, (ii) efficient large-context decoding with a Wasserstein-based objective, and (iii) consistent improvements across text-to-image, image inpainting, and unconditional generation, achieving SOTA MS-COCO text-to-image results. Impact: The approach enhances local semantic continuity and fidelity in diffusion outputs, with broad applicability to discrete and continuous backbones and practical gains in image synthesis tasks.

Abstract

Diffusion models are a new class of generative models, and have dramatically promoted image generation with unprecedented quality and diversity. Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes. However, such point-based reconstruction may fail to make each predicted pixel/feature fully preserve its neighborhood context, impairing diffusion-based image synthesis. As a powerful source of automatic supervisory signal, context has been well studied for learning representations. Inspired by this, we for the first time propose ConPreDiff to improve diffusion-based image synthesis with context prediction. We explicitly reinforce each point to predict its neighborhood context (i.e., multi-stride features/tokens/pixels) with a context decoder at the end of diffusion denoising blocks in training stage, and remove the decoder for inference. In this way, each point can better reconstruct itself by preserving its semantic connections with neighborhood context. This new paradigm of ConPreDiff can generalize to arbitrary discrete and continuous diffusion backbones without introducing extra parameters in sampling procedure. Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.
Paper Structure (31 sections, 1 theorem, 17 equations, 6 figures, 4 tables)

This paper contains 31 sections, 1 theorem, 17 equations, 6 figures, 4 tables.

Key Result

Theorem 4.1

For any $\epsilon>0$, if the support of the distribution $\mathcal{P}_v^{(i)}$ is confined to a bounded space of $\mathbb{R}^d$, there exists a FNN $u(\cdot):\mathbb{R}^d\rightarrow \mathbb{R}$ (and thus its gradient $\nabla u(\cdot):\mathbb{R}^d\rightarrow \mathbb{R}^d$) with sufficiently large wid

Figures (6)

  • Figure 1: In training stage, ConPreDiff first performs self-denoising as standard diffusion models, then it conducts neighborhood context prediction based on denoised point $\bm{x}^i_{t-1}$. In inference stage, ConPreDiff only uses its self-denoising network for sampling.
  • Figure 2: Synthesis examples demonstrating text-to-image capabilities of for various text prompts with LDM, Imagen, and ConPreDiff (Ours). Our model can better express local contexts and semantics of the texts marked in blue.
  • Figure 3: Inpainting examples generated by our ConPreDiff.
  • Figure 4: Bar denotes FID and line denotes time cost.
  • Figure 5: Equip diffusion models with our context prediction.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 4.1