Table of Contents
Fetching ...

InstanceV: Instance-Level Video Generation

Yuheng Chen, Teng Hu, Jiangning Zhang, Zhucun Xue, Ran Yi, Lizhuang Ma

TL;DR

InstanceV introduces instance-level controllability for video diffusion by integrating instance grounding into the generation process through Instance-aware Masked Cross-Attention (IMCA), Shared Timestep-Adaptive Prompt Enhancement (STAPE), and Spatially-Aware Unconditional Guidance (SAUG). A data-preparation pipeline with MLLM-based instance partitioning and downscale/patchifying supports robust instance grounding with limited compute. The proposed InstanceBench provides a comprehensive evaluation for instance-level video generation, and experiments show InstanceV achieves strong instance fidelity, improved layout accuracy, and better text–video alignment while maintaining overall video quality. The work contributes an efficient, training-friendly architecture for fine-grained, location-specific control in text-to-video diffusion and introduces benchmarks to rigorously assess instance-level performance.

Abstract

Recent advances in text-to-video diffusion models have enabled the generation of high-quality videos conditioned on textual descriptions. However, most existing text-to-video models rely solely on textual conditions, lacking general fine-grained controllability over video generation. To address this challenge, we propose InstanceV, a video generation framework that enables i) instance-level control and ii) global semantic consistency. Specifically, with the aid of proposed Instance-aware Masked Cross-Attention mechanism, InstanceV maximizes the utilization of additional instance-level grounding information to generate correctly attributed instances at designated spatial locations. To improve overall consistency, We introduce the Shared Timestep-Adaptive Prompt Enhancement module, which connects local instances with global semantics in a parameter-efficient manner. Furthermore, we incorporate Spatially-Aware Unconditional Guidance during both training and inference to alleviate the disappearance of small instances. Finally, we propose a new benchmark, named InstanceBench, which combines general video quality metrics with instance-aware metrics for more comprehensive evaluation on instance-level video generation. Extensive experiments demonstrate that InstanceV not only achieves remarkable instance-level controllability in video generation, but also outperforms existing state-of-the-art models in both general quality and instance-aware metrics across qualitative and quantitative evaluations.

InstanceV: Instance-Level Video Generation

TL;DR

InstanceV introduces instance-level controllability for video diffusion by integrating instance grounding into the generation process through Instance-aware Masked Cross-Attention (IMCA), Shared Timestep-Adaptive Prompt Enhancement (STAPE), and Spatially-Aware Unconditional Guidance (SAUG). A data-preparation pipeline with MLLM-based instance partitioning and downscale/patchifying supports robust instance grounding with limited compute. The proposed InstanceBench provides a comprehensive evaluation for instance-level video generation, and experiments show InstanceV achieves strong instance fidelity, improved layout accuracy, and better text–video alignment while maintaining overall video quality. The work contributes an efficient, training-friendly architecture for fine-grained, location-specific control in text-to-video diffusion and introduces benchmarks to rigorously assess instance-level performance.

Abstract

Recent advances in text-to-video diffusion models have enabled the generation of high-quality videos conditioned on textual descriptions. However, most existing text-to-video models rely solely on textual conditions, lacking general fine-grained controllability over video generation. To address this challenge, we propose InstanceV, a video generation framework that enables i) instance-level control and ii) global semantic consistency. Specifically, with the aid of proposed Instance-aware Masked Cross-Attention mechanism, InstanceV maximizes the utilization of additional instance-level grounding information to generate correctly attributed instances at designated spatial locations. To improve overall consistency, We introduce the Shared Timestep-Adaptive Prompt Enhancement module, which connects local instances with global semantics in a parameter-efficient manner. Furthermore, we incorporate Spatially-Aware Unconditional Guidance during both training and inference to alleviate the disappearance of small instances. Finally, we propose a new benchmark, named InstanceBench, which combines general video quality metrics with instance-aware metrics for more comprehensive evaluation on instance-level video generation. Extensive experiments demonstrate that InstanceV not only achieves remarkable instance-level controllability in video generation, but also outperforms existing state-of-the-art models in both general quality and instance-aware metrics across qualitative and quantitative evaluations.

Paper Structure

This paper contains 36 sections, 7 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The instance-level grounding information not only helps the video diffusion model better understand fine-grained control cues in the global video caption, such as instances layout and attributes, but also enables indirect realization of instance-level trajectory control and global camera motion control.
  • Figure 2: Overview of the proposed InstanceV framework. (a) Only a small subset of visual tokens is shown due to space limitations; this visualization can be interpreted as a case where $H=1$. (b) Solid colors denote independently encoded instance prompts, while the tokens with color encode more global semantic information. (c) The grounding information consists of $F$ groups of attention masks and instance prompts, where the instances differ across frames due to temporal variation. Note that for simplicity, the attention masks are drawn as $F \times H \times W$, with colors distinguishing different instances.
  • Figure 3: Quality comparison between our proposed InstanceV and state-of-the-art video diffusion models. The first, middle, and last frames are shown for illustration.
  • Figure 4: Ablation studies on the proposed modules.
  • Figure 5: Ablation on CFG scale. We compare three settings (CFG = 5, 6, 7) by visualizing four representative videos, each shown with its first frame, a middle frame, and the final frame.
  • ...and 4 more figures