AnyCharV: Bootstrap Controllable Character Video Generation with Fine-to-Coarse Guidance
Zhao Wang, Hao Wen, Lingting Zhu, Chenming Shang, Yujiu Yang, Qi Dou
TL;DR
AnyCharV tackles the problem of flexibly generating videos that feature an arbitrary source character in a target scene, guided by pose information. It introduces a two-stage diffusion-based framework: Stage 1 performs self-supervised composition with fine segmentation mask and pose to align the reference character with the target video, and Stage 2 uses self-boosting with coarse bounding box guidance to better preserve character identity during inference. The method is validated on a newly constructed CharVG dataset, showing superior qualitative and quantitative performance over state-of-the-art baselines and demonstrating applicability to text-to-image/video workflows. This work provides a practical, generalizable pipeline for controllable character video generation with strong identity retention and flexible background control, enabling applications in film, art, and AI-assisted content creation.
Abstract
Character video generation is a significant real-world application focused on producing high-quality videos featuring specific characters. Recent advancements have introduced various control signals to animate static characters, successfully enhancing control over the generation process. However, these methods often lack flexibility, limiting their applicability and making it challenging for users to synthesize a source character into a desired target scene. To address this issue, we propose a novel framework, AnyCharV, that flexibly generates character videos using arbitrary source characters and target scenes, guided by pose information. Our approach involves a two-stage training process. In the first stage, we develop a base model capable of integrating the source character with the target scene using pose guidance. The second stage further bootstraps controllable generation through a self-boosting mechanism, where we use the generated video in the first stage and replace the fine mask with the coarse one, enabling training outcomes with better preservation of character details. Extensive experimental results demonstrate the superiority of our method compared with previous state-of-the-art methods.
