CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

Bojia Zi; Shihao Zhao; Xianbiao Qi; Jianan Wang; Yukai Shi; Qianyu Chen; Bin Liang; Kam-Fai Wong; Lei Zhang

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

Bojia Zi, Shihao Zhao, Xianbiao Qi, Jianan Wang, Yukai Shi, Qianyu Chen, Bin Liang, Kam-Fai Wong, Lei Zhang

TL;DR

CoCoCo tackles inconsistencies and weak text-video alignment in text-guided video inpainting by introducing a motion capture module with damped global attention and textual cross-attention, an instance-aware region selection strategy grounded by GroundingDINO, and a task-vector based adaptation to plug personalized T2I models into a latent-diffusion inpainting framework. Built on a UNet diffusion backbone with a frozen spatial block and a trainable motion module, it operates on latent representations encoded by a VAE and supports integration of user-specific models for personalized content in the masked regions. Evaluations on WebVid-10M demonstrate improved motion consistency, textual controllability, and compatibility with personalized models, outperforming several baselines in background preservation and temporal coherence and achieving competitive text alignment. The work offers practical, modular components that enhance the reliability and customization of text-guided video inpainting for real-world use.

Abstract

Recent advancements in video generation have been remarkable, yet many existing methods struggle with issues of consistency and poor text-video alignment. Moreover, the field lacks effective techniques for text-guided video inpainting, a stark contrast to the well-explored domain of text-guided image inpainting. To this end, this paper proposes a novel text-guided video inpainting model that achieves better consistency, controllability and compatibility. Specifically, we introduce a simple but efficient motion capture module to preserve motion consistency, and design an instance-aware region selection instead of a random region selection to obtain better textual controllability, and utilize a novel strategy to inject some personalized models into our CoCoCo model and thus obtain better model compatibility. Extensive experiments show that our model can generate high-quality video clips. Meanwhile, our model shows better motion consistency, textual controllability and model compatibility. More details are shown in [cococozibojia.github.io](cococozibojia.github.io).

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

TL;DR

Abstract

Paper Structure (21 sections, 6 equations, 13 figures, 3 tables)

This paper contains 21 sections, 6 equations, 13 figures, 3 tables.

Introduction
Related Work
Methodology
Methodology
The Overall Framework of CoCoCo
Motion Capture Module
Instance-aware Region Selection for Video Inpainting
Adapting Image Generation Model for Video Inpainting
Training Objectives
Experiments
Implementation Details
Experimental Results
Quantitative Comparison.
Qualitative Results.
Ablation Study.
...and 6 more sections

Figures (13)

Figure 1: The inpainting results of our CoCoCo method. The first and second rows are the results of our model with CounterfeitV30 T2I personalized model plugged in, and the last two rows are the results only with our model. Best viewed with Acrobat Reader. Click the images to play the animation clips.
Figure 2: The overall framework of CoCoCo. As shown in the figure, CoCoCo has three inputs including masked video, mask, and noised video. As shown in the above of the figure, our model can adapt the text-to-image (T2I) personalized models without model-specific tuning to perform text-guided video inpainting. The personalized models can be downloaded from the opensource platforms, such as CivitAI and Huggingface. Meanwhile, as shown in the below of the figure, our model uses a newly introduced motion capture module that consists of three types of attention blocks.
Figure 3: The comparison between temporal attention and damped global attention. The dotted line indicates the positions that can be attended.
Figure 4: The instance-aware region selection pipeline and data sampling strategy. Specifically, we use the tokenspan to fix the candidate phrases and use the random-shaped mask to cover the bounding box. We sample three types of input data with different probabilities when training.
Figure 5: The pipeline of the our transformation strategy. As shown in the figure, we compute the task vector of inpaintng $\tau_{ip}$ and personalized generation $\tau_{p}$ and subsequently mix the two vectors with the ratio of $\alpha$ and $\beta$ to obtain personalized inpainting model $\theta_{new}$.
...and 8 more figures

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

TL;DR

Abstract

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

Authors

TL;DR

Abstract

Table of Contents

Figures (13)