Table of Contents
Fetching ...

SMooGPT: Stylized Motion Generation using Large Language Models

Lei Zhong, Yi Yang, Changjian Li

TL;DR

This work addresses stylized motion generation by reframing it as a reasoning–composition–generation problem in a body-part textual space, enabling fine-grained and interpretable control over motion content and style. A fine-tuned LLM, SMooGPT, serves as the motion reasoner, composer, and generator, translating content and style into body-part texts and reconciling conflicts before diffusion-based refinement decodes them into Motions. Key contributions include the body-part space representation, pre- and post-training to align text and motion modalities, a diffusion head for cross-part coordination, and extensive evaluation on HumanML3D and 100STYLE with both text-guided and motion-guided stylization; results show superior performance and generalization to unseen styles, supported by a user study. The approach offers a principled, interpretable pathway to flexible and scalable stylized motion generation with practical impact for animation, game design, and robotics, leveraging open-vocabulary style descriptions via LLMs.

Abstract

Stylized motion generation is actively studied in computer graphics, especially benefiting from the rapid advances in diffusion models. The goal of this task is to produce a novel motion respecting both the motion content and the desired motion style, e.g., ``walking in a loop like a Monkey''. Existing research attempts to address this problem via motion style transfer or conditional motion generation. They typically embed the motion style into a latent space and guide the motion implicitly in a latent space as well. Despite the progress, their methods suffer from low interpretability and control, limited generalization to new styles, and fail to produce motions other than ``walking'' due to the strong bias in the public stylization dataset. In this paper, we propose to solve the stylized motion generation problem from a new perspective of reasoning-composition-generation, based on our observations: i) human motion can often be effectively described using natural language in a body-part centric manner, ii) LLMs exhibit a strong ability to understand and reason about human motion, and iii) human motion has an inherently compositional nature, facilitating the new motion content or style generation via effective recomposing. We thus propose utilizing body-part text space as an intermediate representation, and present SMooGPT, a fine-tuned LLM, acting as a reasoner, composer, and generator when generating the desired stylized motion. Our method executes in the body-part text space with much higher interpretability, enabling fine-grained motion control, effectively resolving potential conflicts between motion content and style, and generalizes well to new styles thanks to the open-vocabulary ability of LLMs. Comprehensive experiments and evaluations, and a user perceptual study, demonstrate the effectiveness of our approach, especially under the pure text-driven stylized motion generation.

SMooGPT: Stylized Motion Generation using Large Language Models

TL;DR

This work addresses stylized motion generation by reframing it as a reasoning–composition–generation problem in a body-part textual space, enabling fine-grained and interpretable control over motion content and style. A fine-tuned LLM, SMooGPT, serves as the motion reasoner, composer, and generator, translating content and style into body-part texts and reconciling conflicts before diffusion-based refinement decodes them into Motions. Key contributions include the body-part space representation, pre- and post-training to align text and motion modalities, a diffusion head for cross-part coordination, and extensive evaluation on HumanML3D and 100STYLE with both text-guided and motion-guided stylization; results show superior performance and generalization to unseen styles, supported by a user study. The approach offers a principled, interpretable pathway to flexible and scalable stylized motion generation with practical impact for animation, game design, and robotics, leveraging open-vocabulary style descriptions via LLMs.

Abstract

Stylized motion generation is actively studied in computer graphics, especially benefiting from the rapid advances in diffusion models. The goal of this task is to produce a novel motion respecting both the motion content and the desired motion style, e.g., ``walking in a loop like a Monkey''. Existing research attempts to address this problem via motion style transfer or conditional motion generation. They typically embed the motion style into a latent space and guide the motion implicitly in a latent space as well. Despite the progress, their methods suffer from low interpretability and control, limited generalization to new styles, and fail to produce motions other than ``walking'' due to the strong bias in the public stylization dataset. In this paper, we propose to solve the stylized motion generation problem from a new perspective of reasoning-composition-generation, based on our observations: i) human motion can often be effectively described using natural language in a body-part centric manner, ii) LLMs exhibit a strong ability to understand and reason about human motion, and iii) human motion has an inherently compositional nature, facilitating the new motion content or style generation via effective recomposing. We thus propose utilizing body-part text space as an intermediate representation, and present SMooGPT, a fine-tuned LLM, acting as a reasoner, composer, and generator when generating the desired stylized motion. Our method executes in the body-part text space with much higher interpretability, enabling fine-grained motion control, effectively resolving potential conflicts between motion content and style, and generalizes well to new styles thanks to the open-vocabulary ability of LLMs. Comprehensive experiments and evaluations, and a user perceptual study, demonstrate the effectiveness of our approach, especially under the pure text-driven stylized motion generation.

Paper Structure

This paper contains 55 sections, 8 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Motion, Body Parts, and Natural Language. Human motion is a coherent composition of body-part movements that can be described well using natural language. (a) A style motion can be naturally characterized by a combination of body parts styles, and similarly, (b) arbitrary content motion can be depicted by the same set of body parts using natural language. (c) LLMs have the intrinsic power to understand human motion and produce part-based descriptions in line with human interpretation in the language domain (see the highlighted common color).
  • Figure 2: Body Part Space. Our body-part space consists of three main elements - the global text, the body-part texts, and the corresponding 4D motion. By changing from "backward" to "forward" of the left arm, a novel stylized motion can be obtained (i.e., "Superhero"), due to the compositional nature of the space (bottom-left).
  • Figure 3: Stylized Motion Generation. (a) Our body-part centric stylized motion generation methodology adopts a reasoning, composing, and generation framework, where the given motion content and style (either texts or motion sequences) are translated and composed in the body-part space, and the resulting body-part texts are further translated back to the motion space, producing the stylized motion. (b) We invent SMooGPT by fine-tuning an LLM with body part-based tokenization (top), pre- and post-training (top) stages, operating in our methodology.
  • Figure 4: Result Gallery. Using SMooGPT, we have generated a diverse set of stylized motions guided by various combinations of motion content and style texts. To highlight our method's capabilities, we focus on specialized pure text-based inputs. Each example presents the content and style descriptions, the resulting motion depicted through selected frames, and the corresponding simplified body-part texts. The full body-part texts are available in the supplementary.
  • Figure 5: Visual Comparison. Typical examples of the comparison between our method and competitors on the text-guided and motion-guided stylization tasks are shown. Pay attention to the imperfection in either the motion content or the style of the generated motion from competitors.
  • ...and 4 more figures