Table of Contents
Fetching ...

Sensorimotor Attention and Language-based Regressions in Shared Latent Variables for Integrating Robot Motion Learning and LLM

Kanata Suzuki, Tetsuya Ogata

TL;DR

This work tackles the challenge of online adaptation when grounding language instructions to robot motion by linking a motion-learning model (SATrRNN) with a language model (RWKV) through a shared latent variable (SLV). During training, the system learns to map language and sensorimotor signals into a common latent space, while a regression phase updates SLV online based on prediction errors from sensor attention and language predictions, enabling adaptive motion generation without updating the LLM weights. Empirical results in a Robosuite Panda setup for Lift, Roll, and Stack tasks show strong position generalization and language generalization when error regression is applied, with substantial gains in success rates compared to no-regression baselines. The work also analyzes internal representations, revealing how SLV trajectories organize by task and how attention mechanisms evolve to align with motion goals, supporting the grounding capability of the approach. This method improves data efficiency and offers a path toward end-to-end, feedback-driven grounding of language to robot control in realistic settings.

Abstract

In recent years, studies have been actively conducted on combining large language models (LLM) and robotics; however, most have not considered end-to-end feedback in the robot-motion generation phase. The prediction of deep neural networks must contain errors, it is required to update the trained model to correspond to the real environment to generate robot motion adaptively. This study proposes an integration method that connects the robot-motion learning model and LLM using shared latent variables. When generating robot motion, the proposed method updates shared parameters based on prediction errors from both sensorimotor attention points and task language instructions given to the robot. This allows the model to search for latent parameters appropriate for the robot task efficiently. Through simulator experiments on multiple robot tasks, we demonstrated the effectiveness of our proposed method from two perspectives: position generalization and language instruction generalization abilities.

Sensorimotor Attention and Language-based Regressions in Shared Latent Variables for Integrating Robot Motion Learning and LLM

TL;DR

This work tackles the challenge of online adaptation when grounding language instructions to robot motion by linking a motion-learning model (SATrRNN) with a language model (RWKV) through a shared latent variable (SLV). During training, the system learns to map language and sensorimotor signals into a common latent space, while a regression phase updates SLV online based on prediction errors from sensor attention and language predictions, enabling adaptive motion generation without updating the LLM weights. Empirical results in a Robosuite Panda setup for Lift, Roll, and Stack tasks show strong position generalization and language generalization when error regression is applied, with substantial gains in success rates compared to no-regression baselines. The work also analyzes internal representations, revealing how SLV trajectories organize by task and how attention mechanisms evolve to align with motion goals, supporting the grounding capability of the approach. This method improves data efficiency and offers a path toward end-to-end, feedback-driven grounding of language to robot control in realistic settings.

Abstract

In recent years, studies have been actively conducted on combining large language models (LLM) and robotics; however, most have not considered end-to-end feedback in the robot-motion generation phase. The prediction of deep neural networks must contain errors, it is required to update the trained model to correspond to the real environment to generate robot motion adaptively. This study proposes an integration method that connects the robot-motion learning model and LLM using shared latent variables. When generating robot motion, the proposed method updates shared parameters based on prediction errors from both sensorimotor attention points and task language instructions given to the robot. This allows the model to search for latent parameters appropriate for the robot task efficiently. Through simulator experiments on multiple robot tasks, we demonstrated the effectiveness of our proposed method from two perspectives: position generalization and language instruction generalization abilities.
Paper Structure (20 sections, 2 equations, 8 figures, 2 tables)

This paper contains 20 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of this study. In the proposed method, latent variables related to robot tasks are updated based on prediction errors for instruction sentences and sensorimotor attention.
  • Figure 2: Overview of the proposed method, consisting of three modules: SATrRNN with mask predictor, RWKV, and shared latent variables.
  • Figure 3: Overview of the proposed error regression method. The SLV is optimized from the reconstruction error for language instruction and MSE between extracted attention points (blue circle marks) and predicted attention points (red cross marks).
  • Figure 4: Robot task setup in our experiments.
  • Figure 5: Examples of generated Lift, Roll, and Stack task sequences in case 2 (test position).
  • ...and 3 more figures