In-Video Instructions: Visual Signals as Generative Control

Gongfan Fang; Xinyin Ma; Xinchao Wang

In-Video Instructions: Visual Signals as Generative Control

Gongfan Fang, Xinyin Ma, Xinchao Wang

TL;DR

This work tackles controllable image-to-video generation by enabling zero-shot control through visual signals embedded directly in the initial frame. It introduces In-Video Instruction, which uses two simple primitives—overlaid text and arrows—to create explicit, spatially grounded correspondences between subjects and actions, without retraining or architectural changes. The method is evaluated across multiple state-of-the-art generators, demonstrating reliable interpretation of embedded instructions and superior localization in multi-object scenes, along with fine-grained motion and camera controls. The approach offers a flexible, interpretable interface for controllable video synthesis with potential for broader application and future refinement, such as removing visual markers post-generation and leveraging natural scene signals.

Abstract

Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.

In-Video Instructions: Visual Signals as Generative Control

TL;DR

Abstract

In-Video Instructions: Visual Signals as Generative Control

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)