Table of Contents
Fetching ...

Fleximo: Towards Flexible Text-to-Human Motion Video Generation

Yuhang Zhang, Yuan Zhou, Zeyu Liu, Yuxuan Cai, Qiuyue Wang, Aidong Men, Huan Yang

TL;DR

Fleximo tackles the challenge of generating high-quality human motion videos from a reference image and natural language, circumventing the need for large text-video training data. It combines an anchor-point based rescale, a skeleton adapter for detailed hand/face motion, LLM-driven planning for long sequences, and a refinement step to produce coherent, identity-consistent videos. The authors introduce MotionBench and MotionScore to benchmark and quantify motion-text alignment, demonstrating that Fleximo outperforms existing text-conditioned image-to-video baselines on both visual quality and motion fidelity. This work significantly lowers the barrier to flexible, text-driven human motion video generation and provides standardized evaluation resources for future research.

Abstract

Current methods for generating human motion videos rely on extracting pose sequences from reference videos, which restricts flexibility and control. Additionally, due to the limitations of pose detection techniques, the extracted pose sequences can sometimes be inaccurate, leading to low-quality video outputs. We introduce a novel task aimed at generating human motion videos solely from reference images and natural language. This approach offers greater flexibility and ease of use, as text is more accessible than the desired guidance videos. However, training an end-to-end model for this task requires millions of high-quality text and human motion video pairs, which are challenging to obtain. To address this, we propose a new framework called Fleximo, which leverages large-scale pre-trained text-to-3D motion models. This approach is not straightforward, as the text-generated skeletons may not consistently match the scale of the reference image and may lack detailed information. To overcome these challenges, we introduce an anchor point based rescale method and design a skeleton adapter to fill in missing details and bridge the gap between text-to-motion and motion-to-video generation. We also propose a video refinement process to further enhance video quality. A large language model (LLM) is employed to decompose natural language into discrete motion sequences, enabling the generation of motion videos of any desired length. To assess the performance of Fleximo, we introduce a new benchmark called MotionBench, which includes 400 videos across 20 identities and 20 motions. We also propose a new metric, MotionScore, to evaluate the accuracy of motion following. Both qualitative and quantitative results demonstrate that our method outperforms existing text-conditioned image-to-video generation methods. All code and model weights will be made publicly available.

Fleximo: Towards Flexible Text-to-Human Motion Video Generation

TL;DR

Fleximo tackles the challenge of generating high-quality human motion videos from a reference image and natural language, circumventing the need for large text-video training data. It combines an anchor-point based rescale, a skeleton adapter for detailed hand/face motion, LLM-driven planning for long sequences, and a refinement step to produce coherent, identity-consistent videos. The authors introduce MotionBench and MotionScore to benchmark and quantify motion-text alignment, demonstrating that Fleximo outperforms existing text-conditioned image-to-video baselines on both visual quality and motion fidelity. This work significantly lowers the barrier to flexible, text-driven human motion video generation and provides standardized evaluation resources for future research.

Abstract

Current methods for generating human motion videos rely on extracting pose sequences from reference videos, which restricts flexibility and control. Additionally, due to the limitations of pose detection techniques, the extracted pose sequences can sometimes be inaccurate, leading to low-quality video outputs. We introduce a novel task aimed at generating human motion videos solely from reference images and natural language. This approach offers greater flexibility and ease of use, as text is more accessible than the desired guidance videos. However, training an end-to-end model for this task requires millions of high-quality text and human motion video pairs, which are challenging to obtain. To address this, we propose a new framework called Fleximo, which leverages large-scale pre-trained text-to-3D motion models. This approach is not straightforward, as the text-generated skeletons may not consistently match the scale of the reference image and may lack detailed information. To overcome these challenges, we introduce an anchor point based rescale method and design a skeleton adapter to fill in missing details and bridge the gap between text-to-motion and motion-to-video generation. We also propose a video refinement process to further enhance video quality. A large language model (LLM) is employed to decompose natural language into discrete motion sequences, enabling the generation of motion videos of any desired length. To assess the performance of Fleximo, we introduce a new benchmark called MotionBench, which includes 400 videos across 20 identities and 20 motions. We also propose a new metric, MotionScore, to evaluate the accuracy of motion following. Both qualitative and quantitative results demonstrate that our method outperforms existing text-conditioned image-to-video generation methods. All code and model weights will be made publicly available.

Paper Structure

This paper contains 22 sections, 6 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Given a reference image and a motion text, our method Fleximo can generate motion videos containing the reference identity performing the motions described in the motion text. The reference image is shown at the first frame, the input motion text is on the top of the figure. Different colors mark the slices of different motion segments.
  • Figure 2: The framework of Fleximo. We use LLM to plan long motion texts. The text-to-motion module generates 3D mesh vertices corresponding to the motion texts. Then, these vertices are projected into 2D space. The 2D skeleton points are scaled based on the anchor point and formulated as skeleton videos. These skeleton videos are input into the skeleton adapter for detail completion. The output skeleton video and the reference image are used as guidance for human motion video generation. A video refinement process can further improve the generated video quality.
  • Figure 3: The structure of our proposed skeleton adapter. The CLIP radford2021learning and VAE kingma2013auto are fixed during training, while the PoseNet is trained from scratch, the U-Net ronneberger2015u is fine-tuned from Stable Video Diffusion blattmann2023stable. The reference image is sampled randomly from training pose videos with hands and we use the pose videos without hands for motion guidance. During inference, reference image is detected from given image and the handless pose video is generated by text-to-motion module.
  • Figure 4: Qualitative results of Fleximo (first three rows) compared to the SOTA text-conditioned image-to-video generation method, DynamiCrafter (last row), in the text-to-human motion video generation task.
  • Figure 5: The generated pose video (the second row) of skeleton adapter given the handless pose video (the first row). Skeleton adapter can maintain the motion in the handless pose video while completing detailed and realistic hand information. Please zoom in for better visualization.
  • ...and 9 more figures