Table of Contents
Fetching ...

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

Zhoujie Fu, Xianfang Zeng, Jinghong Lan, Xinyao Liao, Cheng Chen, Junyi Chen, Jiacheng Wei, Wei Cheng, Shiyu Liu, Yunuo Chen, Gang Yu, Guosheng Lin

TL;DR

iMontage reuses a pretrained video diffusion backbone to enable highly dynamic, many-to-many image generation with temporal coherence preserved. It introduces a Marginal RoPE-based temporal embedding and a comprehensive data pipeline plus a three-stage, multi-task training regime to unify one-to-one editing, many-to-one generation, and many-to-many storyboard tasks. The approach achieves state-of-the-art or competitive results across editing benchmarks, OmniContext, and storyboard evaluations, and is supported by extensive ablations validating RoPE design and training strategies. While capable across diverse tasks, it acknowledges limitations in long-context handling and proposes future scaling and data enhancements.

Abstract

Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

TL;DR

iMontage reuses a pretrained video diffusion backbone to enable highly dynamic, many-to-many image generation with temporal coherence preserved. It introduces a Marginal RoPE-based temporal embedding and a comprehensive data pipeline plus a three-stage, multi-task training regime to unify one-to-one editing, many-to-one generation, and many-to-many storyboard tasks. The approach achieves state-of-the-art or competitive results across editing benchmarks, OmniContext, and storyboard evaluations, and is supported by extensive ablations validating RoPE design and training strategies. While capable across diverse tasks, it acknowledges limitations in long-context handling and proposes future scaling and data enhancements.

Abstract

Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.

Paper Structure

This paper contains 25 sections, 2 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: iMontage can flexibly deal with many input images, and can generate many output images with highly consistency. We use three different colors to represent three settings. The dotted-line box images are the input.
  • Figure 2: Overview of iMontage. The model accepts a flexible set of reference images and produces N outputs conditioned on a text prompt. Images are encoded by a 3D VAE separately, text by a language model, and both token streams are processed by an MMDiT. We concatenate clean reference-image tokens with noisy target tokens before denoising. Right: training uses fixed-length text tokens and variable-length image/noise tokens, transitions from dual stream to single stream blocks. For image branch, we apply Marginal RoPE, a head–tail temporal indexing that separates input and output pseudo-frames, preserves spatial RoPE, and supports many-to-many generation. In figure, notation H and W with subscription denote the height/width indices of the 2D RoPE computed at the image’s native resolution, while notation T represents assigned time index for temporal dimension.
  • Figure 3: Overview of our dataset: Our dataset is constructed from four sources and is organized into two stages, comprising high-quality foundational data and multiple task-oriented subsets.
  • Figure 4: Comparison with three baselines on storyboard generation setting. Single character and many characters samples are presented.
  • Figure 5: Ablation on different RoPE strategy. We evaluate on a subset of the editing data with low resolution, training each strategy for the same number of steps. In the figure, corner numbers indicate provenance: 1 original input, 2 edited ground truth, 3 output from Marginal RoPE, and 4 output from Even RoPE.
  • ...and 14 more figures