Table of Contents
Fetching ...

AnyCharV: Bootstrap Controllable Character Video Generation with Fine-to-Coarse Guidance

Zhao Wang, Hao Wen, Lingting Zhu, Chenming Shang, Yujiu Yang, Qi Dou

TL;DR

AnyCharV tackles the problem of flexibly generating videos that feature an arbitrary source character in a target scene, guided by pose information. It introduces a two-stage diffusion-based framework: Stage 1 performs self-supervised composition with fine segmentation mask and pose to align the reference character with the target video, and Stage 2 uses self-boosting with coarse bounding box guidance to better preserve character identity during inference. The method is validated on a newly constructed CharVG dataset, showing superior qualitative and quantitative performance over state-of-the-art baselines and demonstrating applicability to text-to-image/video workflows. This work provides a practical, generalizable pipeline for controllable character video generation with strong identity retention and flexible background control, enabling applications in film, art, and AI-assisted content creation.

Abstract

Character video generation is a significant real-world application focused on producing high-quality videos featuring specific characters. Recent advancements have introduced various control signals to animate static characters, successfully enhancing control over the generation process. However, these methods often lack flexibility, limiting their applicability and making it challenging for users to synthesize a source character into a desired target scene. To address this issue, we propose a novel framework, AnyCharV, that flexibly generates character videos using arbitrary source characters and target scenes, guided by pose information. Our approach involves a two-stage training process. In the first stage, we develop a base model capable of integrating the source character with the target scene using pose guidance. The second stage further bootstraps controllable generation through a self-boosting mechanism, where we use the generated video in the first stage and replace the fine mask with the coarse one, enabling training outcomes with better preservation of character details. Extensive experimental results demonstrate the superiority of our method compared with previous state-of-the-art methods.

AnyCharV: Bootstrap Controllable Character Video Generation with Fine-to-Coarse Guidance

TL;DR

AnyCharV tackles the problem of flexibly generating videos that feature an arbitrary source character in a target scene, guided by pose information. It introduces a two-stage diffusion-based framework: Stage 1 performs self-supervised composition with fine segmentation mask and pose to align the reference character with the target video, and Stage 2 uses self-boosting with coarse bounding box guidance to better preserve character identity during inference. The method is validated on a newly constructed CharVG dataset, showing superior qualitative and quantitative performance over state-of-the-art baselines and demonstrating applicability to text-to-image/video workflows. This work provides a practical, generalizable pipeline for controllable character video generation with strong identity retention and flexible background control, enabling applications in film, art, and AI-assisted content creation.

Abstract

Character video generation is a significant real-world application focused on producing high-quality videos featuring specific characters. Recent advancements have introduced various control signals to animate static characters, successfully enhancing control over the generation process. However, these methods often lack flexibility, limiting their applicability and making it challenging for users to synthesize a source character into a desired target scene. To address this issue, we propose a novel framework, AnyCharV, that flexibly generates character videos using arbitrary source characters and target scenes, guided by pose information. Our approach involves a two-stage training process. In the first stage, we develop a base model capable of integrating the source character with the target scene using pose guidance. The second stage further bootstraps controllable generation through a self-boosting mechanism, where we use the generated video in the first stage and replace the fine mask with the coarse one, enabling training outcomes with better preservation of character details. Extensive experimental results demonstrate the superiority of our method compared with previous state-of-the-art methods.

Paper Structure

This paper contains 45 sections, 3 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Our method works by giving a reference character image (left) and a target driving video (top) supplying complex motion, background scene, and interaction information. Our method can naturally synthesize the given arbitrary reference character following the motion in the target video, preserving the real-world scene and complex interaction.
  • Figure 2: The overview of our proposed AnyCharV. We design a two-stage pipeline with fine-to-coarse guidance for controllable character video generation. In the first stage, we utilize a self-supervised manner to train a base model for composing a reference character with the target scene $\boldsymbol{s}_d$, guided by fine segmentation mask $\boldsymbol{m}_d$ and 2D pose sequence $\boldsymbol{p}_d$. In the second stage, we propose a self-boosting training strategy by interacting between the reference and target character using coarse bounding box mask guidance. The CLIP encoder and VAE are always frozen. We train denoising UNet, ReferenceNet, and pose guider during the first stage, while only finetuning denoising UNet in the second stage.
  • Figure 3: Qualitative results of our method compared with previous SOTA methods. The reference character and target video are shown in the top. Each following line indicates a method.
  • Figure 4: Visualization for the effect of different components. The reference character and target video are shown in the top. Each following line indicates the generated video from a variant.
  • Figure 5: Visualization for the effect of different mask types. The reference character and target video are shown in the top. Each following line indicates the generated video from a variant. 'Box & Seg.' indicates that the bounding box mask is used in the first stage and the segmentation mask is used in the second stage.
  • ...and 5 more figures