Table of Contents
Fetching ...

How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

Haoyu Chen, Qing Liu, Yuqian Zhou, He Zhang, Zhaowen Wang, Mengwei Ren, Jingjing Ren, Xiang Wang, Zhe Lin, Lei Zhu

TL;DR

This work proposes UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall, and dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings.

Abstract

Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.

How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

TL;DR

This work proposes UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall, and dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings.

Abstract

Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.
Paper Structure (75 sections, 5 equations, 28 figures, 5 tables)

This paper contains 75 sections, 5 equations, 28 figures, 5 tables.

Figures (28)

  • Figure 1: UniLongGen enables long-horizon interleaved image generation in a single unified sequence. It generates over 40 images while maintaining high visual quality and cross-image consistency.
  • Figure 2: Generation quality degrades as sequence length increases ($1024{\times}576$).Left: Normalized scores for four metrics (Consistency, Quality, HPS v3, PickScore) across a 40-image sequence, averaged over multiple runs (shaded bands indicate variance). All metrics remain relatively stable during the first $\sim$20 images, then undergo a sharp collapse. Right: Representative samples at different positions. Early shots exhibit high visual fidelity and coherent scene composition, whereas later shots deteriorate into severe artifacts, structural distortions, and ultimately unrecognizable outputs.
  • Figure 3: Bottleneck: Generation quality depends on image count, not token length. We visualize generation quality (HPS v3, color scale) for long-horizon sequences with different tokens-per-image rates ($1024{\times}1024$, $1024{\times}576$, $768{\times}432$). Despite wide disparities in cumulative token count (y-axis), all settings exhibit quality collapse within the same image-index window (20--25 images, shaded region). Horizontal comparisons (dashed gray lines) confirm that matching the token budget does not predict success; instead, the number of semantic events (image index) is the dominant bottleneck.
  • Figure 4: Token-matched text vs. image history induces distinct attention regimes and failure modes. We compare (A) a 120K-token text-only history and (B) a token-matched image-heavy history (22 images). Top row (diagnostics): similarity-score distributions from current-image queries to historical tokens show that text history produces a consistently concentrated score spectrum, while image-heavy history exhibits much higher variance and a heavier tail (outliers). A concentration plot (cumulative attention vs. top-% tokens) further indicates that attention under image history is more top-heavy (higher Gini). Bottom row (outcomes): token-matched text history mainly leads to passive dilution (blurred, under-conditioned generations), whereas token-matched image history causes active pollution (artifacts, speckle, and structural distortions).
  • Figure 5: Attention becomes diluted and unfocused as context grows. Attention entropy rises with context length, indicating the model becomes increasingly "confused" about where to attend.
  • ...and 23 more figures