Table of Contents
Fetching ...

StickMotion: Generating 3D Human Motions by Drawing a Stickman

Tao Wang, Zhihua Wu, Qiaozhi He, Jiaming Chu, Ling Qian, Yu Cheng, Junliang Xing, Jian Zhao, Lei Jin

TL;DR

This work tackles the challenge of generating 3D human motions from text by introducing StickMotion, a diffusion-based framework that jointly leverages textual descriptions and stickman cues placed at the start, middle, and end of a motion sequence. A Stickman Generation Algorithm automatically creates stickman representations, while a Multi-Condition Module fuses text and stickman inputs efficiently during diffusion, supported by a Dynamic Supervision strategy that aligns stickman positions with natural motion. The authors also propose the StiSim metric to quantify stickman influence and report competitive results on KIT-ML and HumanML3D, along with a user study showing about 51.5% time savings for sketch-based specification. Overall, StickMotion enables more intuitive, user-friendly control of 3D motion generation with reduced computational cost and validated effectiveness.

Abstract

Text-to-motion generation, which translates textual descriptions into human motions, has been challenging in accurately capturing detailed user-imagined motions from simple text inputs. This paper introduces StickMotion, an efficient diffusion-based network designed for multi-condition scenarios, which generates desired motions based on traditional text and our proposed stickman conditions for global and local control of these motions, respectively. We address the challenges introduced by the user-friendly stickman from three perspectives: 1) Data generation. We develop an algorithm to generate hand-drawn stickmen automatically across different dataset formats. 2) Multi-condition fusion. We propose a multi-condition module that integrates into the diffusion process and obtains outputs of all possible condition combinations, reducing computational complexity and enhancing StickMotion's performance compared to conventional approaches with the self-attention module. 3) Dynamic supervision. We empower StickMotion to make minor adjustments to the stickman's position within the output sequences, generating more natural movements through our proposed dynamic supervision strategy. Through quantitative experiments and user studies, sketching stickmen saves users about 51.5% of their time generating motions consistent with their imagination. Our codes, demos, and relevant data will be released to facilitate further research and validation within the scientific community.

StickMotion: Generating 3D Human Motions by Drawing a Stickman

TL;DR

This work tackles the challenge of generating 3D human motions from text by introducing StickMotion, a diffusion-based framework that jointly leverages textual descriptions and stickman cues placed at the start, middle, and end of a motion sequence. A Stickman Generation Algorithm automatically creates stickman representations, while a Multi-Condition Module fuses text and stickman inputs efficiently during diffusion, supported by a Dynamic Supervision strategy that aligns stickman positions with natural motion. The authors also propose the StiSim metric to quantify stickman influence and report competitive results on KIT-ML and HumanML3D, along with a user study showing about 51.5% time savings for sketch-based specification. Overall, StickMotion enables more intuitive, user-friendly control of 3D motion generation with reduced computational cost and validated effectiveness.

Abstract

Text-to-motion generation, which translates textual descriptions into human motions, has been challenging in accurately capturing detailed user-imagined motions from simple text inputs. This paper introduces StickMotion, an efficient diffusion-based network designed for multi-condition scenarios, which generates desired motions based on traditional text and our proposed stickman conditions for global and local control of these motions, respectively. We address the challenges introduced by the user-friendly stickman from three perspectives: 1) Data generation. We develop an algorithm to generate hand-drawn stickmen automatically across different dataset formats. 2) Multi-condition fusion. We propose a multi-condition module that integrates into the diffusion process and obtains outputs of all possible condition combinations, reducing computational complexity and enhancing StickMotion's performance compared to conventional approaches with the self-attention module. 3) Dynamic supervision. We empower StickMotion to make minor adjustments to the stickman's position within the output sequences, generating more natural movements through our proposed dynamic supervision strategy. Through quantitative experiments and user studies, sketching stickmen saves users about 51.5% of their time generating motions consistent with their imagination. Our codes, demos, and relevant data will be released to facilitate further research and validation within the scientific community.

Paper Structure

This paper contains 12 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Human motions generated by StickMotion under both stickmen and textual description conditions. The black number under the stickman denotes the index of a frame in the generated motion sequences, at which the human pose is generated with regarding the stickman. S, M, and E denote the start, middle, and end of this motion sequence. These above stickman figures are drawn by users.
  • Figure 2: Stickmen generated by Stickman Generation Algorithm on the KIT-ML plappert2016kit and HumanML3D guo2022generating dataset.
  • Figure 3: The StickMotion framework consists of the diffusion process on the left and the network structure on the right. 1) The diffusion process is divided into two components: the forward process and the reverse process. In the forward process, original motions are artificially augmented with Gaussian noise and fed into StickMotion to facilitate its prediction of the added noise based on text from the dataset and stickman generated by actual motion through the Stickman Generation Algorithm (SGA). In the reverse process, the user's textual descriptions and stickman figures are inputted into StickMotion, enabling the gradual generation of motion sequences with its predicted noise. 2) Regarding the structure of StickMotion, both the stickman encoder and text encoder remain frozen while other components participate in training. After encoding the input data, it undergoes multiple Multi-Condition Modules (MCM) to obtain predictions for noise, which are then utilized in generating motion sequences during the reverse process.
  • Figure 4: Visualization with various input combinations.
  • Figure 5: Comparison between overall description & stickman for StickMotion (above) and detailed description for ReMoDiffuse (below).