Table of Contents
Fetching ...

Stitch: Training-Free Position Control in Multimodal Diffusion Transformers

Jessica Bader, Mateusz Pach, Maria A. Bravo, Serge Belongie, Zeynep Akata

TL;DR

Stitch introduces a training-free method to inject external position control into MMDiT-based T2I models by decomposing prompts into object-specific sub-prompts with LLM-generated bounding boxes, constraining early generation with Region Binding, and extracting and stitching foreground tokens via Cutout. The approach yields substantial improvements on PosEval, a challenging position-focused extension of GenEval, and achieves state-of-the-art results with several base models without sacrificing image quality. PosEval reveals persistent gaps in complex positional reasoning, while Stitch demonstrates robust gains across 2–4 object configurations and various baselines, including Qwen-Image, FLUX, and SD3.5, underscoring the practical value of training-free, bounding-box conditioned generation for spatial prompts in real-world applications.

Abstract

Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like "above" or "to the right of" poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes. Stitch produces images that are both spatially accurate and visually appealing by generating individual objects within designated bounding boxes and seamlessly stitching them together. We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image. We evaluate Stitch on PosEval, our benchmark for position-based T2I generation. Featuring five new tasks that extend the concept of Position beyond the basic GenEval task, PosEval demonstrates that even top models still have significant room for improvement in position-based generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances base models, even improving FLUX by 218% on GenEval's Position task and by 206% on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on PosEval, improving over previous models by 54%, all accomplished while integrating position control into leading models training-free. Code is available at https://github.com/ExplainableML/Stitch.

Stitch: Training-Free Position Control in Multimodal Diffusion Transformers

TL;DR

Stitch introduces a training-free method to inject external position control into MMDiT-based T2I models by decomposing prompts into object-specific sub-prompts with LLM-generated bounding boxes, constraining early generation with Region Binding, and extracting and stitching foreground tokens via Cutout. The approach yields substantial improvements on PosEval, a challenging position-focused extension of GenEval, and achieves state-of-the-art results with several base models without sacrificing image quality. PosEval reveals persistent gaps in complex positional reasoning, while Stitch demonstrates robust gains across 2–4 object configurations and various baselines, including Qwen-Image, FLUX, and SD3.5, underscoring the practical value of training-free, bounding-box conditioned generation for spatial prompts in real-world applications.

Abstract

Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like "above" or "to the right of" poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes. Stitch produces images that are both spatially accurate and visually appealing by generating individual objects within designated bounding boxes and seamlessly stitching them together. We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image. We evaluate Stitch on PosEval, our benchmark for position-based T2I generation. Featuring five new tasks that extend the concept of Position beyond the basic GenEval task, PosEval demonstrates that even top models still have significant room for improvement in position-based generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances base models, even improving FLUX by 218% on GenEval's Position task and by 206% on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on PosEval, improving over previous models by 54%, all accomplished while integrating position control into leading models training-free. Code is available at https://github.com/ExplainableML/Stitch.

Paper Structure

This paper contains 25 sections, 3 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: (a) Stitch boosts position-aware generation, training-free, (b) by generating objects in LLM-made bounding boxes (dashed lines) and using attention heads for tighter latent segmentation mid-generation (filled). (c) Our PosEval benchmark extends GenEval with 5 new positional tasks.
  • Figure 2: Stitch excels at complex positional prompts.
  • Figure 3: Stitch: Multimodal LLM $L$ splits full prompt $P$ into object prompts ${p_k}$ and bounding boxes ${b_k}$, along with full-image background prompt $p_0$. MMDiT-based model $F$ seperately sketches objects and background (butterfly , skateboard , park ) for $S$ timesteps. With Region Binding attention-masking constraints, $F$ generates each $p_k$ in $b_k$. In Cutout, the highest attention weights in a targeted head select tighter latent regions linked to foreground objects $\mathcal{Z}_{v,k}^{S}$, which are merged with the background latents $\mathcal{Z}_{v,0}^{S}$ to form composite latent $\mathcal{C}$. For the remaining steps, the unconstrained $F$ seamlessly stitches the sketches into a coherent image conditioned on the full prompt.
  • Figure 4: Stitch corrects Qwen-Image (QwenI) and FLUX position on PosEval without quality loss.
  • Figure 5: Cutout cleanly extracts objects mid-generation.
  • ...and 10 more figures