ACDC: Autoregressive Coherent Multimodal Generation using Diffusion Correction

Hyungjin Chung; Dohun Lee; Jong Chul Ye

ACDC: Autoregressive Coherent Multimodal Generation using Diffusion Correction

Hyungjin Chung, Dohun Lee, Jong Chul Ye

TL;DR

This work introduces Autoregressive Coherent multimodal generation with Diffusion Correction (ACDC), a zero-shot approach that combines the strengths of both ARMs and DMs at the inference stage without the need for additional fine-tuning, achieving superior performance while remaining agnostic to specific ARM and DM architectures.

Abstract

Autoregressive models (ARMs) and diffusion models (DMs) represent two leading paradigms in generative modeling, each excelling in distinct areas: ARMs in global context modeling and long-sequence generation, and DMs in generating high-quality local contexts, especially for continuous data such as images and short videos. However, ARMs often suffer from exponential error accumulation over long sequences, leading to physically implausible results, while DMs are limited by their local context generation capabilities. In this work, we introduce Autoregressive Coherent multimodal generation with Diffusion Correction (ACDC), a zero-shot approach that combines the strengths of both ARMs and DMs at the inference stage without the need for additional fine-tuning. ACDC leverages ARMs for global context generation and memory-conditioned DMs for local correction, ensuring high-quality outputs by correcting artifacts in generated multimodal tokens. In particular, we propose a memory module based on large language models (LLMs) that dynamically adjusts the conditioning texts for the DMs, preserving crucial global context information. Our experiments on multimodal tasks, including coherent multi-frame story generation and autoregressive video generation, demonstrate that ACDC effectively mitigates the accumulation of errors and significantly enhances the quality of generated outputs, achieving superior performance while remaining agnostic to specific ARM and DM architectures. Project page: https://acdc2025.github.io/

ACDC: Autoregressive Coherent Multimodal Generation using Diffusion Correction

TL;DR

Abstract

Paper Structure (31 sections, 3 theorems, 32 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 31 sections, 3 theorems, 32 equations, 10 figures, 7 tables, 1 algorithm.

Introduction
Related works
Unifying autoregressive models with diffusion models
Diffusion models as vision decoder for autoregressive models
Proposals in unified architecture
Diffusion models for image/video sequence generation
Diffusion models as world models
Diffusion models as long sequence generators
ADC: Autoregressive modeling with diffusion correction
World model for story generation
Autoregressive multimodal modeling
Diffusion correction
Large Language Models as memory module
Incorporating physical constraints
Extension to long video generation
...and 16 more sections

Key Result

theorem 1

The KL divergence between $p_t$ and $q_t$ monotonically decreases through forward diffusion, i.e.

Figures (10)

Figure 1: Comparison between a standard multimodal ARM and its ADC corrected version. Row 1-4: story generation, Row 5-6: long video generation. Prompts provided in App. \ref{['app:prompts_for_results']}.
Figure 2: Illustration of the proposed ADC method.
Figure 3: Before (left) and after (right) correction through the proposed LLM memory module. Key global context is distilled into the local prompts.
Figure 4: Qualitative comparison of the story generation task.
Figure 5: Incorporating user constraints to correct for physical errors in the generated image frames.
...and 5 more figures

Theorems & Definitions (5)

theorem 1: nie2022diffusion
theorem 2
proof
theorem 3
proof

ACDC: Autoregressive Coherent Multimodal Generation using Diffusion Correction

TL;DR

Abstract

ACDC: Autoregressive Coherent Multimodal Generation using Diffusion Correction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (5)