Loom: Diffusion-Transformer for Interleaved Generation

Mingcheng Ye; Jiaming Liu; Yiren Song

Loom: Diffusion-Transformer for Interleaved Generation

Mingcheng Ye, Jiaming Liu, Yiren Song

TL;DR

Loom proposes a unified diffusion-transformer that enables interleaved text–image generation across procedural tutorials, style transfer, and compositional tasks. By extending the Bagel backbone with a planning-first strategy, temporal embeddings, sparse historical frame sampling, and entity-anchored control, Loom achieves long-horizon, multi-modal reasoning with improved text-image alignment. The authors curate a 50K interleaved tutorial dataset and demonstrate substantial gains over open-source baselines on both quantitative metrics and qualitative assessments. This work advances open-source capabilities for coherent, multi-turn interleaved multimodal generation and provides a scalable framework for real-world instructional and creative workflows.

Abstract

Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence, enabling tasks such as style transfer, compositional synthesis, and procedural tutorials. We present Loom, a unified diffusion-transformer framework for interleaved text-image generation. Loom extends the Bagel unified model via full-parameter fine-tuning and an interleaved architecture that alternates textual and visual embeddings for multi-condition reasoning and sequential planning. A language planning strategy first decomposes a user instruction into stepwise prompts and frame embeddings, which guide temporally consistent synthesis. For each frame, Loom conditions on a small set of sampled prior frames together with the global textual context, rather than concatenating all history, yielding controllable and efficient long-horizon generation. Across style transfer, compositional generation, and tutorial-like procedures, Loom delivers superior compositionality, temporal coherence, and text-image alignment. Experiments demonstrate that Loom substantially outperforms the open-source baseline Anole, achieving an average gain of 2.6 points (on a 5-point scale) across temporal and semantic metrics in text-to-interleaved tasks. We also curate a 50K interleaved tutorial dataset and demonstrate strong improvements over unified and diffusion editing baselines.

Loom: Diffusion-Transformer for Interleaved Generation

TL;DR

Abstract

Loom: Diffusion-Transformer for Interleaved Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)