Stitch: Training-Free Position Control in Multimodal Diffusion Transformers
Jessica Bader, Mateusz Pach, Maria A. Bravo, Serge Belongie, Zeynep Akata
TL;DR
Stitch introduces a training-free method to inject external position control into MMDiT-based T2I models by decomposing prompts into object-specific sub-prompts with LLM-generated bounding boxes, constraining early generation with Region Binding, and extracting and stitching foreground tokens via Cutout. The approach yields substantial improvements on PosEval, a challenging position-focused extension of GenEval, and achieves state-of-the-art results with several base models without sacrificing image quality. PosEval reveals persistent gaps in complex positional reasoning, while Stitch demonstrates robust gains across 2–4 object configurations and various baselines, including Qwen-Image, FLUX, and SD3.5, underscoring the practical value of training-free, bounding-box conditioned generation for spatial prompts in real-world applications.
Abstract
Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like "above" or "to the right of" poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes. Stitch produces images that are both spatially accurate and visually appealing by generating individual objects within designated bounding boxes and seamlessly stitching them together. We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image. We evaluate Stitch on PosEval, our benchmark for position-based T2I generation. Featuring five new tasks that extend the concept of Position beyond the basic GenEval task, PosEval demonstrates that even top models still have significant room for improvement in position-based generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances base models, even improving FLUX by 218% on GenEval's Position task and by 206% on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on PosEval, improving over previous models by 54%, all accomplished while integrating position control into leading models training-free. Code is available at https://github.com/ExplainableML/Stitch.
