Table of Contents
Fetching ...

Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, Kuan-Chieh Jackson Wang

TL;DR

Canvas-to-Image tackles the challenge of multi-modal, compositional image generation by introducing a unified Multi-Task Canvas that encodes diverse controls into a single RGB input for a Vision-Language Model–Diffusion backbone. It trains a model on a curriculum of single-control canvases (Spatial, Pose, Box) with a task-aware flow-matching loss ${\mathcal{L}}_{\text{flow}}$ to enable emergent multi-control reasoning at inference time. The approach demonstrates strong improvements in identity preservation and control adherence across four challenging benchmarks, achieving superior performance on multi-control compositions without task-specific retraining. This work provides a scalable pathway for multimodal design tools, enabling coherent, flexible guidance across subjects, poses, and spatial layouts with broad practical impact in art and design applications.

Abstract

While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.

Canvas-to-Image: Compositional Image Generation with Multimodal Controls

TL;DR

Canvas-to-Image tackles the challenge of multi-modal, compositional image generation by introducing a unified Multi-Task Canvas that encodes diverse controls into a single RGB input for a Vision-Language Model–Diffusion backbone. It trains a model on a curriculum of single-control canvases (Spatial, Pose, Box) with a task-aware flow-matching loss to enable emergent multi-control reasoning at inference time. The approach demonstrates strong improvements in identity preservation and control adherence across four challenging benchmarks, achieving superior performance on multi-control compositions without task-specific retraining. This work provides a scalable pathway for multimodal design tools, enabling coherent, flexible guidance across subjects, poses, and spatial layouts with broad practical impact in art and design applications.

Abstract

While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.

Paper Structure

This paper contains 32 sections, 1 equation, 19 figures, 11 tables.

Figures (19)

  • Figure 1: Canvas-to-Image enables compositional control for text-to-image generation through a unified Multi-Task Canvas framework. The canvas serves as a flexible visual interface that guides image synthesis by supporting diverse guiding signals, including spatially positioned subjects, pose signals, bounding boxes, and text annotations.
  • Figure 2: Overview of Canvas-to-Image framework. (a) Multi-Task Canvas Training. We reformulate heterogeneous control tasks: spatial composition, pose guidance, and layout-constrained generation into a single canvas-to-image formulation. Each training step samples one type of canvas (Spatial, Pose, or Box), where the target frame serves as supervision. All control signals are encoded as RGB canvases interpretable by the Vision-Language Model (VLM) for unified visual–spatial reasoning. The Multi-Modal DiT (MM-DiT) receives VLM embeddings, VAE latents, and noisy latents to predict the velocity for flow matching. (b) Inference. Although trained on single-control samples, the model generalizes to multi-control compositions, jointly leveraging pose, layout, and reference cues within a single generation process. This enables coherent multi-control reasoning without task-specific retraining.
  • Figure 3: Qualitative Comparisons on 4P Composition Benchmark. Under the Spatial Canvas setup, our Canvas-to-Image achieves the highest identity preservation for multi-subject insertion while respecting the spatial placement of each subject segment. FLUX Kontext fluxkontext-based approach ilkerzgi2025overlay fails to preserve identity, whereas NanoBanana comanici2025gemini consistently exhibits copy-pasting artifacts. Compared to our base model, Qwen-Image-Edit wu2025qwen, our method maintains similar image quality but demonstrates significantly stronger identity preservation.
  • Figure 4: Qualitative Comparisons on Pose-Overlaid 4P Composition Benchmark. Our Canvas-to-Image achieves the highest identity preservation and most accurate pose alignment. Note how Canvas-to-Image closely follows the target poses defined in the prior generated by FLUX-Dev FLUX ("Pose Prior" column), while maintaining subject identities more faithfully than the baselines.
  • Figure 5: Qualitative Comparisons on the Layout-Guided Composition Benchmark. Under the Box Canvas setup, our Canvas-to-Image achieves the highest fidelity in spatial layout control, even compared to the state-of-the-art CreatiDesign zhang2025creatidesign model trained for this task. Nano Banana comanici2025gemini, while demonstrating good image quality, does not adhere to the bounding boxes as closely as our model. Compared to our base model Qwen-Image-Edit wu2025qwen, we achieve the same level of image quality but significantly stronger spatial condition alignment.
  • ...and 14 more figures