Table of Contents
Fetching ...

Point2Insert: Video Object Insertion via Sparse Point Guidance

Yu Zhou, Xiaoyan Yang, Bojia Zi, Lihan Zhang, Ruijie Sun, Weishi Zheng, Haibin Huang, Chi Zhang, Xuelong Li

TL;DR

Point2Insert tackles video object insertion with sparse positive and negative points, eliminating the need for dense masks while enabling precise placement. It introduces a two-stage training pipeline that first learns a mask- and point-guided insertion model and then fine-tunes on removal-derived data, aided by mask-to-point distillation to bridge dense and sparse controls. A large 1.3M-image-scale video-editing dataset and the PointBench benchmark support comprehensive evaluation, showing state-of-the-art performance with a 1.3B parameter model that rivals much larger baselines. The approach delivers accurate, photorealistic insertions across diverse objects and scenes with significantly reduced user effort, marking a practical advance for video editing workflows.

Abstract

This paper introduces Point2Insert, a sparse-point-based framework for flexible and user-friendly object insertion in videos, motivated by the growing popularity of accurate, low-effort object placement. Existing approaches face two major challenges: mask-based insertion methods require labor-intensive mask annotations, while instruction-based methods struggle to place objects at precise locations. Point2Insert addresses these issues by requiring only a small number of sparse points instead of dense masks, eliminating the need for tedious mask drawing. Specifically, it supports both positive and negative points to indicate regions that are suitable or unsuitable for insertion, enabling fine-grained spatial control over object locations. The training of Point2Insert consists of two stages. In Stage 1, we train an insertion model that generates objects in given regions conditioned on either sparse-point prompts or a binary mask. In Stage 2, we further train the model on paired videos synthesized by an object removal model, adapting it to video insertion. Moreover, motivated by the higher insertion success rate of mask-guided editing, we leverage a mask-guided insertion model as a teacher to distill reliable insertion behavior into the point-guided model. Extensive experiments demonstrate that Point2Insert consistently outperforms strong baselines and even surpasses models with $\times$10 more parameters.

Point2Insert: Video Object Insertion via Sparse Point Guidance

TL;DR

Point2Insert tackles video object insertion with sparse positive and negative points, eliminating the need for dense masks while enabling precise placement. It introduces a two-stage training pipeline that first learns a mask- and point-guided insertion model and then fine-tunes on removal-derived data, aided by mask-to-point distillation to bridge dense and sparse controls. A large 1.3M-image-scale video-editing dataset and the PointBench benchmark support comprehensive evaluation, showing state-of-the-art performance with a 1.3B parameter model that rivals much larger baselines. The approach delivers accurate, photorealistic insertions across diverse objects and scenes with significantly reduced user effort, marking a practical advance for video editing workflows.

Abstract

This paper introduces Point2Insert, a sparse-point-based framework for flexible and user-friendly object insertion in videos, motivated by the growing popularity of accurate, low-effort object placement. Existing approaches face two major challenges: mask-based insertion methods require labor-intensive mask annotations, while instruction-based methods struggle to place objects at precise locations. Point2Insert addresses these issues by requiring only a small number of sparse points instead of dense masks, eliminating the need for tedious mask drawing. Specifically, it supports both positive and negative points to indicate regions that are suitable or unsuitable for insertion, enabling fine-grained spatial control over object locations. The training of Point2Insert consists of two stages. In Stage 1, we train an insertion model that generates objects in given regions conditioned on either sparse-point prompts or a binary mask. In Stage 2, we further train the model on paired videos synthesized by an object removal model, adapting it to video insertion. Moreover, motivated by the higher insertion success rate of mask-guided editing, we leverage a mask-guided insertion model as a teacher to distill reliable insertion behavior into the point-guided model. Extensive experiments demonstrate that Point2Insert consistently outperforms strong baselines and even surpasses models with 10 more parameters.
Paper Structure (33 sections, 3 equations, 8 figures, 7 tables)

This paper contains 33 sections, 3 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: We present Point2Insert, a novel sparse point-guided framework designed for precise video object insertion. By leveraging a small number of sparse positive and negative points, our approach bypasses the need for tedious, frame-by-frame masking operations. Point2Insert handles diverse assets, capable of adding objects from people and animals to household items and background elements. Whether dealing with static or moving targets, or videos with moving cameras, Point2Insert consistently produces accurate results with low-effort control points.
  • Figure 2: (a)Training Dataset Construction: Our pipeline detects objects within videos, extracts segmentation masks, removes the objects, and extracts video captions. Finally, we sample positive and negative points for training. (b)Comparison of mask- and point-only results: The mask-based setting generates more natural-looking objects due to explicit boundary cues. Point maps without clear boundaries often result in irregular shapes, geometric distortions, and blurred appearances.
  • Figure 3: Model overview. The left panel illustrates that Point2Insert maps the source, point map, and target videos into a latent space via a VAE. These latents are concatenated channel-wise and processed by a DiT-based denoiser. The right panel details our two-stage training strategy: Stage 1: Insertion Model Pre-training. We use traditional lightweight inpainting to create source videos with objects removed. The DiT is then trained using a flow matching objective with maps generated via hybrid mask and point sampling. Stage 2: Stronger Model Distillation. A frozen teacher model from Stage 1 supervises a student model. While the teacher uses masks, the student employs point maps and source videos from a removal model. For further refinement, weight maps downsampled from the point-based map are applied to a weighted flow matching loss.
  • Figure 4: Qualitative comparison of video object insertion. Source videos (top) with a dark overlay contain sparse editable and uneditable points. To ensure a fair comparison, we adapt the input to handle these sparse signals: for instruction-based models, QwenVL qwen2.5-VL translates source videos with markers into textual position prompts; for mask-based models, the convex hull of the positive points across all keyframes is used as the input mask. In contrast, Point2Insert directly consumes sparse point maps. Bounding boxes highlight specific artifacts: incorrect editing or wrong placement and insertion failure or unintended removal. Our method outperforms other methods in both photorealism and localization accuracy.
  • Figure 5: More video object insertion results.
  • ...and 3 more figures