Table of Contents
Fetching ...

PISCO: Precise Video Instance Insertion with Sparse Control

Xiangbo Gao, Renjie Li, Xinghao Chen, Yuheng Wu, Suofei Feng, Qing Yin, Zhengzhong Tu

TL;DR

PISCO tackles the problem of inserting a specific instance into an existing video with precise spatial placement and temporally coherent dynamics under sparse user supervision. It achieves this with a diffusion-based framework that fuses multi-channel instance cues via a context adapter, and introduces Variable-Information Guidance to handle varying supervision densities along with Distribution-Preserving Temporal Masking to stabilize temporal generation. Depth-aware conditioning, amodal completion, and relighting augmentations further ensure physically plausible interactions and illumination. The authors validate their approach on PISCO-Bench, showing consistent gains over strong inpainting and editing baselines and demonstrating monotonic improvements as more sparse control signals are provided. Collectively, PISCO enables professional-grade, low-effort instance-centric video editing and generalizes to broader editing and simulation tasks.

Abstract

The landscape of AI video generation is undergoing a pivotal shift: moving beyond general generation - which relies on exhaustive prompt-engineering and "cherry-picking" - towards fine-grained, controllable generation and high-fidelity post-processing. In professional AI-assisted filmmaking, it is crucial to perform precise, targeted modifications. A cornerstone of this transition is video instance insertion, which requires inserting a specific instance into existing footage while maintaining scene integrity. Unlike traditional video editing, this task demands several requirements: precise spatial-temporal placement, physically consistent scene interaction, and the faithful preservation of original dynamics - all achieved under minimal user effort. In this paper, we propose PISCO, a video diffusion model for precise video instance insertion with arbitrary sparse keyframe control. PISCO allows users to specify a single keyframe, start-and-end keyframes, or sparse keyframes at arbitrary timestamps, and automatically propagates object appearance, motion, and interaction. To address the severe distribution shift induced by sparse conditioning in pretrained video diffusion models, we introduce Variable-Information Guidance for robust conditioning and Distribution-Preserving Temporal Masking to stabilize temporal generation, together with geometry-aware conditioning for realistic scene adaptation. We further construct PISCO-Bench, a benchmark with verified instance annotations and paired clean background videos, and evaluate performance using both reference-based and reference-free perceptual metrics. Experiments demonstrate that PISCO consistently outperforms strong inpainting and video editing baselines under sparse control, and exhibits clear, monotonic performance improvements as additional control signals are provided. Project page: xiangbogaobarry.github.io/PISCO.

PISCO: Precise Video Instance Insertion with Sparse Control

TL;DR

PISCO tackles the problem of inserting a specific instance into an existing video with precise spatial placement and temporally coherent dynamics under sparse user supervision. It achieves this with a diffusion-based framework that fuses multi-channel instance cues via a context adapter, and introduces Variable-Information Guidance to handle varying supervision densities along with Distribution-Preserving Temporal Masking to stabilize temporal generation. Depth-aware conditioning, amodal completion, and relighting augmentations further ensure physically plausible interactions and illumination. The authors validate their approach on PISCO-Bench, showing consistent gains over strong inpainting and editing baselines and demonstrating monotonic improvements as more sparse control signals are provided. Collectively, PISCO enables professional-grade, low-effort instance-centric video editing and generalizes to broader editing and simulation tasks.

Abstract

The landscape of AI video generation is undergoing a pivotal shift: moving beyond general generation - which relies on exhaustive prompt-engineering and "cherry-picking" - towards fine-grained, controllable generation and high-fidelity post-processing. In professional AI-assisted filmmaking, it is crucial to perform precise, targeted modifications. A cornerstone of this transition is video instance insertion, which requires inserting a specific instance into existing footage while maintaining scene integrity. Unlike traditional video editing, this task demands several requirements: precise spatial-temporal placement, physically consistent scene interaction, and the faithful preservation of original dynamics - all achieved under minimal user effort. In this paper, we propose PISCO, a video diffusion model for precise video instance insertion with arbitrary sparse keyframe control. PISCO allows users to specify a single keyframe, start-and-end keyframes, or sparse keyframes at arbitrary timestamps, and automatically propagates object appearance, motion, and interaction. To address the severe distribution shift induced by sparse conditioning in pretrained video diffusion models, we introduce Variable-Information Guidance for robust conditioning and Distribution-Preserving Temporal Masking to stabilize temporal generation, together with geometry-aware conditioning for realistic scene adaptation. We further construct PISCO-Bench, a benchmark with verified instance annotations and paired clean background videos, and evaluate performance using both reference-based and reference-free perceptual metrics. Experiments demonstrate that PISCO consistently outperforms strong inpainting and video editing baselines under sparse control, and exhibits clear, monotonic performance improvements as additional control signals are provided. Project page: xiangbogaobarry.github.io/PISCO.
Paper Structure (38 sections, 3 equations, 8 figures, 3 tables)

This paper contains 38 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: PISCO enables precise video instance insertion with arbitrary sparse keyframe control. Given a clean input video and a few user-provided instance cutouts at selected timestamps, PISCO inserts the instance with coherent temporal propagation and physical effects while preserving the original background dynamics.
  • Figure 2: Overview of PISCO pipeline. We train a conditional video diffusion model with sparse keyframe control via Variable-Information Guidance (VIG), and stabilize sparse conditioning under pretrained temporal VAEs using Distribution-Preserving Temporal Masking (DPTM): pixel-space nearest-frame interpolation followed by token-level masking, with spatial mask and availability signals injected alongside RGB/depth conditions.
  • Figure 3: Visualization of the depth-aware insertion. Conditioning on depth improves depth ordering and occlusion handling, reducing foreground/background blending artifacts compared to a depth-agnostic variant.
  • Figure 4: Effectiveness of DPTM under sparse guidance. We compare results given segmented instance inputs only on odd frames. Naïve masking leads to distribution shifts and temporal artifacts, whereas DPTM preserves encoder input statistics and significantly improves temporal stability.
  • Figure 5: Visualization of amodal instance completion and relighting augmentations. We complete occluded instance cutouts to form pseudo-amodal inputs and apply moderate relighting to improve illumination compatibility with the target scene while preserving keyframe appearance controllability.
  • ...and 3 more figures