Table of Contents
Fetching ...

Bilingual Text-to-Motion Generation: A New Benchmark and Baselines

Wanjiang Weng, Xiaofeng Tan, Xiangbo Shu, Guo-Sen Xie, Pan Zhou, Hongsong Wang

Abstract

Text-to-motion generation holds significant potential for cross-linguistic applications, yet it is hindered by the lack of bilingual datasets and the poor cross-lingual semantic understanding of existing language models. To address these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed via LLM-assisted annotation and rigorous manual correction. Furthermore, we propose a simple yet effective baseline, Bilingual Motion Diffusion (BiMD), featuring Cross-Lingual Alignment (CLA). CLA explicitly aligns semantic representations across languages, creating a robust conditional space that enables high-quality motion generation from bilingual inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8\% vs. 80.8\%, significantly outperforms monolingual diffusion models and translation baselines on BiHumanML3D, underscoring the critical necessity and reliability of our dataset and the effectiveness of our alignment strategy for cross-lingual motion synthesis. The dataset and code are released at \href{https://wengwanjiang.github.io/BilingualT2M-page}{https://wengwanjiang.github.io/BilingualT2M-page}

Bilingual Text-to-Motion Generation: A New Benchmark and Baselines

Abstract

Text-to-motion generation holds significant potential for cross-linguistic applications, yet it is hindered by the lack of bilingual datasets and the poor cross-lingual semantic understanding of existing language models. To address these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed via LLM-assisted annotation and rigorous manual correction. Furthermore, we propose a simple yet effective baseline, Bilingual Motion Diffusion (BiMD), featuring Cross-Lingual Alignment (CLA). CLA explicitly aligns semantic representations across languages, creating a robust conditional space that enables high-quality motion generation from bilingual inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8\% vs. 80.8\%, significantly outperforms monolingual diffusion models and translation baselines on BiHumanML3D, underscoring the critical necessity and reliability of our dataset and the effectiveness of our alignment strategy for cross-lingual motion synthesis. The dataset and code are released at \href{https://wengwanjiang.github.io/BilingualT2M-page}{https://wengwanjiang.github.io/BilingualT2M-page}

Paper Structure

This paper contains 28 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Code-switching in bilingual text-to-motion generation. Existing text-to-motion methods (e.g., MLD Chen2023 and MotionLCM Dai2025) struggle to interpret mixed-language inputs, leading to incorrect motion semantics, while our BiMD generates semantically consistent motions.
  • Figure 2: Visual comparison of bilingual text-to-motion generation. Existing methods such as MDM Tevet2023 and MLD Chen2023 exhibit limitations in bilingual processing. Directly using pretrained multilingual encoders results in imbalanced performance across languages. Our BiMD with Cross-Lingual Alignment (CLA) generates accurate motions from both English and Chinese descriptions.
  • Figure 3: Pipeline for constructing our bilingual HumanML3D dataset. The data collection and filtering process removes unsuitable motions, ensuring high-quality motion-text pairs for annotation. The annotation pipeline begins with an initial translation stage, followed by a refinement stage to address translation issues. Finally, human annotators manually verify and correct the translation with LLM, ensuring linguistic and contextual accuracy.
  • Figure 4: Framework for training the bilingual motion diffusion model. We align English and Chinese text embeddings in a shared latent space by freezing the teacher model $E^t_{\Phi}$ and fine-tuning the student model $E^s_{\phi}$ with the cross-lingual alignment loss $\mathcal{L}_{CLA}$ in Eq. \ref{['eq:bikl']}. The aligned student model $E^s_{\phi}$ then provides text conditions for training the diffusion model $\epsilon_\theta$, enabling bilingual motion generation while minimizing $\mathcal{L}_{\text{BiMD}}$ in Eq. \ref{['eq:mld_loss']}.
  • Figure 5: Visual comparison of zero‑shot cross‑lingual motion generation from Chinese‑only training to unseen English texts. We find that BiMD with CLA generates accurate, semantically aligned motions despite never having seen English during training.
  • ...and 2 more figures