Table of Contents
Fetching ...

In-Video Instructions: Visual Signals as Generative Control

Gongfan Fang, Xinyin Ma, Xinchao Wang

TL;DR

This work tackles controllable image-to-video generation by enabling zero-shot control through visual signals embedded directly in the initial frame. It introduces In-Video Instruction, which uses two simple primitives—overlaid text and arrows—to create explicit, spatially grounded correspondences between subjects and actions, without retraining or architectural changes. The method is evaluated across multiple state-of-the-art generators, demonstrating reliable interpretation of embedded instructions and superior localization in multi-object scenes, along with fine-grained motion and camera controls. The approach offers a flexible, interpretable interface for controllable video synthesis with potential for broader application and future refinement, such as removing visual markers post-generation and leveraging natural scene signals.

Abstract

Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.

In-Video Instructions: Visual Signals as Generative Control

TL;DR

This work tackles controllable image-to-video generation by enabling zero-shot control through visual signals embedded directly in the initial frame. It introduces In-Video Instruction, which uses two simple primitives—overlaid text and arrows—to create explicit, spatially grounded correspondences between subjects and actions, without retraining or architectural changes. The method is evaluated across multiple state-of-the-art generators, demonstrating reliable interpretation of embedded instructions and superior localization in multi-object scenes, along with fine-grained motion and camera controls. The approach offers a flexible, interpretable interface for controllable video synthesis with potential for broader application and future refinement, such as removing visual markers post-generation and leveraging natural scene signals.

Abstract

Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.

Paper Structure

This paper contains 27 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: In-Video Instruction controls generation by placing the instruction directly on the first frame, providing explicit spatial grounding for the instruction’s scope. This enables assigning independent, less ambiguous, and even multi-step sequential commands to different targets. During generation, we fix the textual prompt to “follow the instructions step by step” and rely solely on in-frame visual signals for control.
  • Figure 2: Spatial Localization Ability of In-Video Instructions. We use In-Video Instructions to localize a target object among multiple entities and execute the corresponding action. For the prompt-based baseline, we rely on ChatGPT-generated textual descriptions such as "the N-th object from the left" for locating. As shown, In-Video Instructions enable precise and unambiguous localization, whereas text-only prompts exhibit noticeable limitations in resolving object positions.
  • Figure 3: Controlling object motions or trajectories with in-video instructions.
  • Figure 4: Controlling camera motion with In-Video Instructions. We visualize the initial frame and the final output for seven camera-motion types: static, pan left, pan right, tilt down, tilt up, zoom in, and zoom out.
  • Figure 5: In-Video Instructions with Multiple Objects and Commands, enabling both sequential instructions that involve a series of actions and parallel instructions that manipulate different objects independently.
  • ...and 1 more figures