Table of Contents
Fetching ...

Empathetic Motion Generation for Humanoid Educational Robots via Reasoning-Guided Vision--Language--Motion Diffusion Architecture

Fuze Sun, Lingyu Li, Lekan Dai, Xinyu Fan

Abstract

This article suggests a reasoning-guided vision-language-motion diffusion framework (RG-VLMD) for generating instruction-aware co-speech gestures for humanoid robots in educational scenarios. The system integrates multi-modal affective estimation, pedagogical reasoning, and teaching-act-conditioned motion synthesis to enable adaptive and semantically consistent robot behavior. A gated mixture-of-experts model predicts Valence/Arousal from input text, visual, and acoustic features, which then mapped to discrete teaching-act categories through an affect-driven policy.These signals condition a diffusion-based motion generator using clip-level intent and frame-level instructional schedules via additive latent restriction with auxiliary action-group supervision. Compared to a baseline diffusion model, our proposed method produces more structured and distinctive motion patterns, as verified by motion statics and pairwise distance analysis. Generated motion sequences remain physically plausible and can be retargeted to a NAO robot for real-time execution. The results reveal that reasoning-guided instructional conditioning improves gesture controllability and pedagogical expressiveness in educational human-robot interaction.

Empathetic Motion Generation for Humanoid Educational Robots via Reasoning-Guided Vision--Language--Motion Diffusion Architecture

Abstract

This article suggests a reasoning-guided vision-language-motion diffusion framework (RG-VLMD) for generating instruction-aware co-speech gestures for humanoid robots in educational scenarios. The system integrates multi-modal affective estimation, pedagogical reasoning, and teaching-act-conditioned motion synthesis to enable adaptive and semantically consistent robot behavior. A gated mixture-of-experts model predicts Valence/Arousal from input text, visual, and acoustic features, which then mapped to discrete teaching-act categories through an affect-driven policy.These signals condition a diffusion-based motion generator using clip-level intent and frame-level instructional schedules via additive latent restriction with auxiliary action-group supervision. Compared to a baseline diffusion model, our proposed method produces more structured and distinctive motion patterns, as verified by motion statics and pairwise distance analysis. Generated motion sequences remain physically plausible and can be retargeted to a NAO robot for real-time execution. The results reveal that reasoning-guided instructional conditioning improves gesture controllability and pedagogical expressiveness in educational human-robot interaction.
Paper Structure (27 sections, 24 equations, 5 figures, 1 table)

This paper contains 27 sections, 24 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Training and inference flow of the proposed valence/arousal estimator. During training, modality-specific XGBoost experts are first optimized on CMU-MOSEI using Huber-style regression objectives, and their outputs are then used to train a softmax-based reliability gate with fused mean squared error loss. During inference, student text, visual, and acoustic features are processed by the trained modality experts, followed by reliability-aware gated fusion and calibration to produce final valence and arousal estimates for downstream teaching-act selection.
  • Figure 2: RAPID-Motion diffusion framework for pedagogical co-speech gesture generation. Student affect is estimated using a MOSEI-trained Valence–Arousal model and converted by an LLM reasoning module into a clip-level teaching-act vector $u$ and frame-level schedule $u_{1:T}$. Audio and text embeddings are used as multimodal conditioning signals, together with instructional vectors and an optional motion prefix. A transformer-based diffusion denoiser generates motion tokens using local attention and self-attention. Training minimizes diffusion reconstruction loss with an auxiliary act-classification loss to enforce pedagogical consistency.
  • Figure 3: Representative gesture poses generated by the proposed model under different teaching-act conditions in the simulation environment. The poses illustrate that the diffusion policy produces distinct motion styles corresponding to pedagogical intentions, including expressive gestures for explanation and praise, directive posture for challenge, and low-intensity motion for neutral interaction.
  • Figure 4: Comparison between ground-truth and predicted valence trajectories. The predicted curves closely follow the temporal trend of the real values. In Dataset 3, the negative dip and final increase are correctly reproduced. In Dataset 4, the model captures the overall positive tendency and major trend changes, although the predicted amplitudes are smoother than the ground-truth values. These results indicate that the model can track valence dynamics while slightly compressing extreme intensities.
  • Figure 5: Pairwise distance heatmaps of normalized motion statistics across teaching-act categories. The baseline model shows weak separation between several acts, while the proposed conditioning produces a more structured distribution, with clear separation for highly expressive acts such as explain.