Point2Insert: Video Object Insertion via Sparse Point Guidance
Yu Zhou, Xiaoyan Yang, Bojia Zi, Lihan Zhang, Ruijie Sun, Weishi Zheng, Haibin Huang, Chi Zhang, Xuelong Li
TL;DR
Point2Insert tackles video object insertion with sparse positive and negative points, eliminating the need for dense masks while enabling precise placement. It introduces a two-stage training pipeline that first learns a mask- and point-guided insertion model and then fine-tunes on removal-derived data, aided by mask-to-point distillation to bridge dense and sparse controls. A large 1.3M-image-scale video-editing dataset and the PointBench benchmark support comprehensive evaluation, showing state-of-the-art performance with a 1.3B parameter model that rivals much larger baselines. The approach delivers accurate, photorealistic insertions across diverse objects and scenes with significantly reduced user effort, marking a practical advance for video editing workflows.
Abstract
This paper introduces Point2Insert, a sparse-point-based framework for flexible and user-friendly object insertion in videos, motivated by the growing popularity of accurate, low-effort object placement. Existing approaches face two major challenges: mask-based insertion methods require labor-intensive mask annotations, while instruction-based methods struggle to place objects at precise locations. Point2Insert addresses these issues by requiring only a small number of sparse points instead of dense masks, eliminating the need for tedious mask drawing. Specifically, it supports both positive and negative points to indicate regions that are suitable or unsuitable for insertion, enabling fine-grained spatial control over object locations. The training of Point2Insert consists of two stages. In Stage 1, we train an insertion model that generates objects in given regions conditioned on either sparse-point prompts or a binary mask. In Stage 2, we further train the model on paired videos synthesized by an object removal model, adapting it to video insertion. Moreover, motivated by the higher insertion success rate of mask-guided editing, we leverage a mask-guided insertion model as a teacher to distill reliable insertion behavior into the point-guided model. Extensive experiments demonstrate that Point2Insert consistently outperforms strong baselines and even surpasses models with $\times$10 more parameters.
