Sensorimotor Attention and Language-based Regressions in Shared Latent Variables for Integrating Robot Motion Learning and LLM
Kanata Suzuki, Tetsuya Ogata
TL;DR
This work tackles the challenge of online adaptation when grounding language instructions to robot motion by linking a motion-learning model (SATrRNN) with a language model (RWKV) through a shared latent variable (SLV). During training, the system learns to map language and sensorimotor signals into a common latent space, while a regression phase updates SLV online based on prediction errors from sensor attention and language predictions, enabling adaptive motion generation without updating the LLM weights. Empirical results in a Robosuite Panda setup for Lift, Roll, and Stack tasks show strong position generalization and language generalization when error regression is applied, with substantial gains in success rates compared to no-regression baselines. The work also analyzes internal representations, revealing how SLV trajectories organize by task and how attention mechanisms evolve to align with motion goals, supporting the grounding capability of the approach. This method improves data efficiency and offers a path toward end-to-end, feedback-driven grounding of language to robot control in realistic settings.
Abstract
In recent years, studies have been actively conducted on combining large language models (LLM) and robotics; however, most have not considered end-to-end feedback in the robot-motion generation phase. The prediction of deep neural networks must contain errors, it is required to update the trained model to correspond to the real environment to generate robot motion adaptively. This study proposes an integration method that connects the robot-motion learning model and LLM using shared latent variables. When generating robot motion, the proposed method updates shared parameters based on prediction errors from both sensorimotor attention points and task language instructions given to the robot. This allows the model to search for latent parameters appropriate for the robot task efficiently. Through simulator experiments on multiple robot tasks, we demonstrated the effectiveness of our proposed method from two perspectives: position generalization and language instruction generalization abilities.
