Table of Contents
Fetching ...

Towards Language-Driven Video Inpainting via Multimodal Large Language Models

Jianzong Wu, Xiangtai Li, Chenyang Si, Shangchen Zhou, Jingkang Yang, Jiangning Zhang, Yining Li, Kai Chen, Yunhai Tong, Ziwei Liu, Chen Change Loy

TL;DR

This work introduces a new task - language-driven video inpainting, which uses natural language instructions to guide the inpainting process, integrating Multimodal Large Language Models to understand and execute complex language-based inpaintingrequests effectively.

Abstract

We introduce a new task -- language-driven video inpainting, which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks, a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework, the first end-to-end baseline for this task, integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. We will make datasets, code, and models publicly available.

Towards Language-Driven Video Inpainting via Multimodal Large Language Models

TL;DR

This work introduces a new task - language-driven video inpainting, which uses natural language instructions to guide the inpainting process, integrating Multimodal Large Language Models to understand and execute complex language-based inpaintingrequests effectively.

Abstract

We introduce a new task -- language-driven video inpainting, which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks, a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework, the first end-to-end baseline for this task, integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. We will make datasets, code, and models publicly available.
Paper Structure (22 sections, 9 equations, 16 figures, 4 tables)

This paper contains 22 sections, 9 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Language-driven video inpainting. It contains two sub-tasks based on the expression types. The referring video inpainting task takes simple referring expressions as input, while interactive video inpainting receives chat-style conversations. The conversation may encounter implicit requests, and the model needs to reason for a correct understanding.
  • Figure 2: Comparison with general image editing models. InstructPix2Pix instructpix2pix and MagicBrush magicbrush are general image editing methods based on diffusion models. They produce inferior results when instructed to remove objects in videos.
  • Figure 3: The ROVI dataset statistics. Best viewed in color.
  • Figure 4: ROVI dataset annotation pipeline. The building process of the ROVI dataset involves two distinct phases: inpainting annotation and interactive annotation. In the inpainting annotation phase, the primary objective is to incorporate inpainting results into existing referring video segmentation datasets, which initially contain object masks and expressions. During the interactive annotation pipeline, we follow a multi-step approach incorporating LLMs and MLLMs. Best viewed in color.
  • Figure 5: The training process of LGVI and LGVI-I. We inflate the U-Net with a temporal dimension to allow video input. To ensure temporal consistency in the generated videos, we introduce a temporal attention module between cross-attention and FFN layers. Additionally, we propose a mask decoder module for explicit guidance in inpainting tasks. We augment LGVI with MLLM joint training for interactive video inpainting, resulting in LGVI-I as the baseline. The output of MLLM includes a set of prompt tokens, which is fed into the cross attention of the U-Net.
  • ...and 11 more figures