Table of Contents
Fetching ...

T2MBench: A Benchmark for Out-of-Distribution Text-to-Motion Generation

Bin Yang, Rong Ou, Weisheng Xu, Jiaqi Xiong, Xintao Li, Taowen Wang, Luyu Zhu, Xu Jiang, Jing Tan, Renjing Xu

TL;DR

T2MBench introduces an out-of-distribution benchmark for text-to-motion by pairing a 1,025-prompt OOD dataset with a unified, multi-dimensional evaluation framework. It systematically analyzes fourteen state-of-the-art baselines with LLM-based judgments, multi-factor metrics, and fine-grained accuracy tests, uncovering strong semantic alignment and physical plausibility but notable gaps in fine-grained control and long-horizon generalization. The work releases two high-quality datasets derived from evaluation results to support robust benchmarking and reproducible research, and provides standardized prompt design and evaluation protocols to guide future model development. Overall, the findings emphasize the need for improved fine-grained accuracy and cross-domain generalization to enable production-ready text-to-motion systems for animation and robotics.

Abstract

Most existing evaluations of text-to-motion generation focus on in-distribution textual inputs and a limited set of evaluation criteria, which restricts their ability to systematically assess model generalization and motion generation capabilities under complex out-of-distribution (OOD) textual conditions. To address this limitation, we propose a benchmark specifically designed for OOD text-to-motion evaluation, which includes a comprehensive analysis of 14 representative baseline models and the two datasets derived from evaluation results. Specifically, we construct an OOD prompt dataset consisting of 1,025 textual descriptions. Based on this prompt dataset, we introduce a unified evaluation framework that integrates LLM-based Evaluation, Multi-factor Motion evaluation, and Fine-grained Accuracy Evaluation. Our experimental results reveal that while different baseline models demonstrate strengths in areas such as text-to-motion semantic alignment, motion generalizability, and physical quality, most models struggle to achieve strong performance with Fine-grained Accuracy Evaluation. These findings highlight the limitations of existing methods in OOD scenarios and offer practical guidance for the design and evaluation of future production-level text-to-motion models.

T2MBench: A Benchmark for Out-of-Distribution Text-to-Motion Generation

TL;DR

T2MBench introduces an out-of-distribution benchmark for text-to-motion by pairing a 1,025-prompt OOD dataset with a unified, multi-dimensional evaluation framework. It systematically analyzes fourteen state-of-the-art baselines with LLM-based judgments, multi-factor metrics, and fine-grained accuracy tests, uncovering strong semantic alignment and physical plausibility but notable gaps in fine-grained control and long-horizon generalization. The work releases two high-quality datasets derived from evaluation results to support robust benchmarking and reproducible research, and provides standardized prompt design and evaluation protocols to guide future model development. Overall, the findings emphasize the need for improved fine-grained accuracy and cross-domain generalization to enable production-ready text-to-motion systems for animation and robotics.

Abstract

Most existing evaluations of text-to-motion generation focus on in-distribution textual inputs and a limited set of evaluation criteria, which restricts their ability to systematically assess model generalization and motion generation capabilities under complex out-of-distribution (OOD) textual conditions. To address this limitation, we propose a benchmark specifically designed for OOD text-to-motion evaluation, which includes a comprehensive analysis of 14 representative baseline models and the two datasets derived from evaluation results. Specifically, we construct an OOD prompt dataset consisting of 1,025 textual descriptions. Based on this prompt dataset, we introduce a unified evaluation framework that integrates LLM-based Evaluation, Multi-factor Motion evaluation, and Fine-grained Accuracy Evaluation. Our experimental results reveal that while different baseline models demonstrate strengths in areas such as text-to-motion semantic alignment, motion generalizability, and physical quality, most models struggle to achieve strong performance with Fine-grained Accuracy Evaluation. These findings highlight the limitations of existing methods in OOD scenarios and offer practical guidance for the design and evaluation of future production-level text-to-motion models.
Paper Structure (43 sections, 28 equations, 13 figures, 20 tables)

This paper contains 43 sections, 28 equations, 13 figures, 20 tables.

Figures (13)

  • Figure 1: Overall pipeline of T2MBench.
  • Figure 2: Comparison of t-SNE results between our OOD prompt dataset and the HumanML3D text dataset.
  • Figure 3: The LLM-based evaluation pipeline.
  • Figure 4: LLM-Based Evaluation radar charts by evaluation dimensions
  • Figure 5: Multi-Factor Motion Evaluation radar charts by metrics
  • ...and 8 more figures