Table of Contents
Fetching ...

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen

TL;DR

WorldCanvas addresses the need for controllable, semantically rich world-event generation by integrating trajectories, reference images, and text. It introduces a data-curation pipeline to produce trajectory–reference–text triplets and a multimodal model that injects trajectory signals and employs spatial-aware cross-attention to bind motion with captions and visual grounding. Across qualitative and quantitative evaluations, it achieves coherent, memory-like consistency and superior alignment with user inputs compared to state-of-the-art baselines. This work advances interactive world models toward user-shaped simulation and enables more reliable, grounded, and controllable scene synthesis.

Abstract

We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

TL;DR

WorldCanvas addresses the need for controllable, semantically rich world-event generation by integrating trajectories, reference images, and text. It introduces a data-curation pipeline to produce trajectory–reference–text triplets and a multimodal model that injects trajectory signals and employs spatial-aware cross-attention to bind motion with captions and visual grounding. Across qualitative and quantitative evaluations, it achieves coherent, memory-like consistency and superior alignment with user inputs compared to state-of-the-art baselines. This work advances interactive world models toward user-shaped simulation and enables more reliable, grounded, and controllable scene synthesis.

Abstract

We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

Paper Structure

This paper contains 25 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The architecture of our WorldCanvas. The data pipeline generates high-quality trajectory–reference–text triplets (in the figure, gray boxes denote reference images extracted from the video, and hollow circles along trajectories indicate invisible points due to occlusion or rotation). The Spatial-Aware Weighted Cross-Attention mechanism explicitly aligns each caption with its associated trajectory.
  • Figure 2: Qualitative comparison on promptable world event modeling. Our model successfully generates results that align with given trajectories, text prompt and reference images, whereas the baselines fail to properly correspond to these inputs.
  • Figure 3: Qualitative comparison of multi-subject trajectory-text alignment. Our method accurately aligns the textual descriptions with motions specified by trajectories, whereas the baselines fail to produce correct results in such cases.
  • Figure 4: Consistency maintenance results. The shown examples correspond to object consistency preservation, scene consistency preservation, and character consistency preservation, respectively.
  • Figure 5: Qualitative results for our ablation study. Compared to variants without Spatial-Aware Weighted Cross-Attention and Hard Cross-Attention, the former causes severe semantic-action misalignment, while the latter yields incomplete semantics. In contrast, our method effectively achieves accurate alignment between semantic content and trajectories.
  • ...and 3 more figures