Table of Contents
Fetching ...

MoChat: Joints-Grouped Spatio-Temporal Grounding LLM for Multi-Turn Motion Comprehension and Description

Jiawei Mo, Yixuan Chen, Rifen Lin, Yongkang Ni, Min Zeng, Xiping Hu, Min Li

TL;DR

MoChat is proposed, a multimodal large language model capable of spatio-temporal grounding of human motion and understanding multi-turn dialogue context and is hailed as the first model capable of fine-grained spatio-temporal grounding of human motion.

Abstract

Despite continuous advancements in deep learning for understanding human motion, existing models often struggle to accurately identify action timing and specific body parts, typically supporting only single-round interaction. Such limitations in capturing fine-grained motion details reduce their effectiveness in motion understanding tasks. In this paper, we propose MoChat, a multimodal large language model capable of spatio-temporal grounding of human motion and understanding multi-turn dialogue context. To achieve these capabilities, we group the spatial information of each skeleton frame based on human anatomical structure and then apply them with Joints-Grouped Skeleton Encoder, whose outputs are combined with LLM embeddings to create spatio-aware and temporal-aware embeddings separately. Additionally, we develop a pipeline for extracting timestamps from skeleton sequences based on textual annotations, and construct multi-turn dialogues for spatially grounding. Finally, various task instructions are generated for jointly training. Experimental results demonstrate that MoChat achieves state-of-the-art performance across multiple metrics in motion understanding tasks, making it as the first model capable of fine-grained spatio-temporal grounding of human motion.

MoChat: Joints-Grouped Spatio-Temporal Grounding LLM for Multi-Turn Motion Comprehension and Description

TL;DR

MoChat is proposed, a multimodal large language model capable of spatio-temporal grounding of human motion and understanding multi-turn dialogue context and is hailed as the first model capable of fine-grained spatio-temporal grounding of human motion.

Abstract

Despite continuous advancements in deep learning for understanding human motion, existing models often struggle to accurately identify action timing and specific body parts, typically supporting only single-round interaction. Such limitations in capturing fine-grained motion details reduce their effectiveness in motion understanding tasks. In this paper, we propose MoChat, a multimodal large language model capable of spatio-temporal grounding of human motion and understanding multi-turn dialogue context. To achieve these capabilities, we group the spatial information of each skeleton frame based on human anatomical structure and then apply them with Joints-Grouped Skeleton Encoder, whose outputs are combined with LLM embeddings to create spatio-aware and temporal-aware embeddings separately. Additionally, we develop a pipeline for extracting timestamps from skeleton sequences based on textual annotations, and construct multi-turn dialogues for spatially grounding. Finally, various task instructions are generated for jointly training. Experimental results demonstrate that MoChat achieves state-of-the-art performance across multiple metrics in motion understanding tasks, making it as the first model capable of fine-grained spatio-temporal grounding of human motion.

Paper Structure

This paper contains 31 sections, 7 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Illustration of the multi-turn spatio-temporal grounding capabilities of MoChat. MoChat is a large language model designed for motion comprehension, with capabilities that extend beyond regular motion description. Specifically, MoChat can follow user instructions to summarize motion sequences (Turn I), pinpoint specific body parts involved in the motion (Turn II), and ground the start and end frames corresponding to user queries (Turn III).
  • Figure 2: Overview of MoChat. Given a skeleton motion sequence as input, (a) Joints-Grouped Skeleton Encoder first extracts motion features by grouping and embedding the joints separately. Then, (b) Projector converts these features into motion tokens $H_s$ in the language latent space. These motion tokens $H_s$ are concatenated with instruction tokens $H_t$ and input to a (c) Large Language Model (LLM). The LLM's final hidden states $H_m$ are decoded into appropriate responses and passed to a (d) Regression Head to obtain the corresponding timestamps.
  • Figure 3: Dialogue Examples. Q represents the human instruction, and A represents the ground truth answer. Only a subset of the templates is shown here; the complete set can be found in the supplementary material.
  • Figure 4: Ablation study of Spatial Limb Grounding task across different models and instruction sets. The module names GLTE, JGSE, and RH refer to Global-Local Transformer Encoder, Joints-Grouped Skeleton Encoder, and Regression Head, respectively. BMUD+SD+TGD refers to model jointly trained on Basic Motion Understanding Dialogue, Spatial Dialogue and Temporal Grounding Dialogue.
  • Figure 5: Pipeline for constructing Temporal Grounding Dialogues. GLM-4 splits the caption into atomic actions and identifies the corresponding most significant joint and coordinate. The curves represent the coordinates of the selected joint, with the numbers on the curves indicating the frame IDs of the extremum points. We construct multi-turn temporal grounding dialogues based on the final extracted results.
  • ...and 3 more figures