Table of Contents
Fetching ...

Loom: Diffusion-Transformer for Interleaved Generation

Mingcheng Ye, Jiaming Liu, Yiren Song

TL;DR

Loom proposes a unified diffusion-transformer that enables interleaved text–image generation across procedural tutorials, style transfer, and compositional tasks. By extending the Bagel backbone with a planning-first strategy, temporal embeddings, sparse historical frame sampling, and entity-anchored control, Loom achieves long-horizon, multi-modal reasoning with improved text-image alignment. The authors curate a 50K interleaved tutorial dataset and demonstrate substantial gains over open-source baselines on both quantitative metrics and qualitative assessments. This work advances open-source capabilities for coherent, multi-turn interleaved multimodal generation and provides a scalable framework for real-world instructional and creative workflows.

Abstract

Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence, enabling tasks such as style transfer, compositional synthesis, and procedural tutorials. We present Loom, a unified diffusion-transformer framework for interleaved text-image generation. Loom extends the Bagel unified model via full-parameter fine-tuning and an interleaved architecture that alternates textual and visual embeddings for multi-condition reasoning and sequential planning. A language planning strategy first decomposes a user instruction into stepwise prompts and frame embeddings, which guide temporally consistent synthesis. For each frame, Loom conditions on a small set of sampled prior frames together with the global textual context, rather than concatenating all history, yielding controllable and efficient long-horizon generation. Across style transfer, compositional generation, and tutorial-like procedures, Loom delivers superior compositionality, temporal coherence, and text-image alignment. Experiments demonstrate that Loom substantially outperforms the open-source baseline Anole, achieving an average gain of 2.6 points (on a 5-point scale) across temporal and semantic metrics in text-to-interleaved tasks. We also curate a 50K interleaved tutorial dataset and demonstrate strong improvements over unified and diffusion editing baselines.

Loom: Diffusion-Transformer for Interleaved Generation

TL;DR

Loom proposes a unified diffusion-transformer that enables interleaved text–image generation across procedural tutorials, style transfer, and compositional tasks. By extending the Bagel backbone with a planning-first strategy, temporal embeddings, sparse historical frame sampling, and entity-anchored control, Loom achieves long-horizon, multi-modal reasoning with improved text-image alignment. The authors curate a 50K interleaved tutorial dataset and demonstrate substantial gains over open-source baselines on both quantitative metrics and qualitative assessments. This work advances open-source capabilities for coherent, multi-turn interleaved multimodal generation and provides a scalable framework for real-world instructional and creative workflows.

Abstract

Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence, enabling tasks such as style transfer, compositional synthesis, and procedural tutorials. We present Loom, a unified diffusion-transformer framework for interleaved text-image generation. Loom extends the Bagel unified model via full-parameter fine-tuning and an interleaved architecture that alternates textual and visual embeddings for multi-condition reasoning and sequential planning. A language planning strategy first decomposes a user instruction into stepwise prompts and frame embeddings, which guide temporally consistent synthesis. For each frame, Loom conditions on a small set of sampled prior frames together with the global textual context, rather than concatenating all history, yielding controllable and efficient long-horizon generation. Across style transfer, compositional generation, and tutorial-like procedures, Loom delivers superior compositionality, temporal coherence, and text-image alignment. Experiments demonstrate that Loom substantially outperforms the open-source baseline Anole, achieving an average gain of 2.6 points (on a 5-point scale) across temporal and semantic metrics in text-to-interleaved tasks. We also curate a 50K interleaved tutorial dataset and demonstrate strong improvements over unified and diffusion editing baselines.

Paper Structure

This paper contains 44 sections, 6 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Showcases of Loom's interleaved text-image generation. Interleaved input tasks involve composing reference images into one scene or style transfer. Interleaved output tasks produce text–image sequences from a prompt, including cooking tutorials or drawing guide.
  • Figure 2: (a) An interleaved input paradigm with various conditional images and text prompts, producing a single image output. (b) Interleaved output paradigm, where the model takes either pure text instructions or mixed text-image guidance and generates multi-round, sequential text-image pairs. (c) Training and inference architecture for the interleaved output paradigm, focusing on case (1) where the input is a text-image guidance sequence and the output is continuous text-image pairs, exemplified by a step-by-step drawing tutorial. The pipeline contains a condition branch, which encodes sparse historical frames via ViT and VAE encoders to provide visual context, and a generation branch, which produces both full-step textual descriptions and the next image under the attention mask, ensuring alignment between global textual planning and incremental image rendering.
  • Figure 3: Interleaved dataset construction: (a) Blogs and videos are collected, and 4–6 frames are uniformly sampled, manually verified, and captioned by GPT‑4o with generation prompts and stepwise captions. (b) Multi‑image composition combines models, objects, and scenes via Nano‑Banana Nano-banana with textual descriptions. (c) Style transfer uses Promptsref promptsref website images; Nano‑Banana performs de‑stylization to obtain realistic images and corresponding prompts.
  • Figure 4: More generation results of Loom in interleaved tasks. The top rows show interleaved output tasks, including text-to-interleaved cooking tutorials and image-to-interleaved painting tutorials. The bottom rows depict interleaved input tasks, covering (1) procedural generation, (2) compositional generation and decomposition, and (3) style transfer from a given reference image.
  • Figure 5: Comparison results. (a) Unified models such as Bagel and Janus-Pro only support single-round input–output generation. (b) Interleaved models, including Anole and Doubao support multi-step text–image generation; we also compared closed-source Doubao with our Loom in both text- and image-to-interleaved tasks.
  • ...and 6 more figures