Table of Contents
Fetching ...

Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance

Beiyuan Zhang, Yue Ma, Chunlei Fu, Xinyang Song, Zhenan Sun, Ziqiang Li

TL;DR

The paper tackles multi-character text-to-video generation guided by pose sequences, addressing the limitation of prior work that mostly handles single-character generation. It introduces Follow-Your-MultiPose (FYM), a tuning-free framework that uses pose-derived masks to define spatial regions, per-character prompts derived via LLMs, spatial-aligned cross-attention to fuse prompts with region guidance, and a multi-branch ControlNet to independently condition each character. The approach is demonstrated on Stable Diffusion 1.5 with various personalized checkpoints, showing improved text-video alignment, pose-video alignment, and temporal coherence, and is shown to generalize across different T2I models. The methodology enables precise, per-character control and style versatility for practical applications in video generation and editing.

Abstract

Text-editable and pose-controllable character video generation is a challenging but prevailing topic with practical applications. However, existing approaches mainly focus on single-object video generation with pose guidance, ignoring the realistic situation that multi-character appear concurrently in a scenario. To tackle this, we propose a novel multi-character video generation framework in a tuning-free manner, which is based on the separated text and pose guidance. Specifically, we first extract character masks from the pose sequence to identify the spatial position for each generating character, and then single prompts for each character are obtained with LLMs for precise text guidance. Moreover, the spatial-aligned cross attention and multi-branch control module are proposed to generate fine grained controllable multi-character video. The visualized results of generating video demonstrate the precise controllability of our method for multi-character generation. We also verify the generality of our method by applying it to various personalized T2I models. Moreover, the quantitative results show that our approach achieves superior performance compared with previous works.

Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance

TL;DR

The paper tackles multi-character text-to-video generation guided by pose sequences, addressing the limitation of prior work that mostly handles single-character generation. It introduces Follow-Your-MultiPose (FYM), a tuning-free framework that uses pose-derived masks to define spatial regions, per-character prompts derived via LLMs, spatial-aligned cross-attention to fuse prompts with region guidance, and a multi-branch ControlNet to independently condition each character. The approach is demonstrated on Stable Diffusion 1.5 with various personalized checkpoints, showing improved text-video alignment, pose-video alignment, and temporal coherence, and is shown to generalize across different T2I models. The methodology enables precise, per-character control and style versatility for practical applications in video generation and editing.

Abstract

Text-editable and pose-controllable character video generation is a challenging but prevailing topic with practical applications. However, existing approaches mainly focus on single-object video generation with pose guidance, ignoring the realistic situation that multi-character appear concurrently in a scenario. To tackle this, we propose a novel multi-character video generation framework in a tuning-free manner, which is based on the separated text and pose guidance. Specifically, we first extract character masks from the pose sequence to identify the spatial position for each generating character, and then single prompts for each character are obtained with LLMs for precise text guidance. Moreover, the spatial-aligned cross attention and multi-branch control module are proposed to generate fine grained controllable multi-character video. The visualized results of generating video demonstrate the precise controllability of our method for multi-character generation. We also verify the generality of our method by applying it to various personalized T2I models. Moreover, the quantitative results show that our approach achieves superior performance compared with previous works.

Paper Structure

This paper contains 13 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The visualization of multi-character generation of our method.
  • Figure 2: Framework of Follow-Your-MultiPose.
  • Figure 3: Details of each module, including prompt filtering module, mask enhancement flow and spatial aligned cross-attention.
  • Figure 4: The visualization of feature maps based on different ControlNet types.
  • Figure 5: The illustration of generating videos of different approaches.
  • ...and 1 more figures