CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility
Bojia Zi, Shihao Zhao, Xianbiao Qi, Jianan Wang, Yukai Shi, Qianyu Chen, Bin Liang, Kam-Fai Wong, Lei Zhang
TL;DR
CoCoCo tackles inconsistencies and weak text-video alignment in text-guided video inpainting by introducing a motion capture module with damped global attention and textual cross-attention, an instance-aware region selection strategy grounded by GroundingDINO, and a task-vector based adaptation to plug personalized T2I models into a latent-diffusion inpainting framework. Built on a UNet diffusion backbone with a frozen spatial block and a trainable motion module, it operates on latent representations encoded by a VAE and supports integration of user-specific models for personalized content in the masked regions. Evaluations on WebVid-10M demonstrate improved motion consistency, textual controllability, and compatibility with personalized models, outperforming several baselines in background preservation and temporal coherence and achieving competitive text alignment. The work offers practical, modular components that enhance the reliability and customization of text-guided video inpainting for real-world use.
Abstract
Recent advancements in video generation have been remarkable, yet many existing methods struggle with issues of consistency and poor text-video alignment. Moreover, the field lacks effective techniques for text-guided video inpainting, a stark contrast to the well-explored domain of text-guided image inpainting. To this end, this paper proposes a novel text-guided video inpainting model that achieves better consistency, controllability and compatibility. Specifically, we introduce a simple but efficient motion capture module to preserve motion consistency, and design an instance-aware region selection instead of a random region selection to obtain better textual controllability, and utilize a novel strategy to inject some personalized models into our CoCoCo model and thus obtain better model compatibility. Extensive experiments show that our model can generate high-quality video clips. Meanwhile, our model shows better motion consistency, textual controllability and model compatibility. More details are shown in [cococozibojia.github.io](cococozibojia.github.io).
