Beyond Master and Apprentice: Grounding Foundation Models for Symbiotic Interactive Learning in a Shared Latent Space
Linus Nwankwo, Björn Ellensohn, Christian Rauch, Elmar Rueckert
TL;DR
The paper addresses the master-apprentice limitations in language-conditioned HRI by proposing Symbiotic Interactive Learning (SIL), a bidirectional, co-adaptive framework that grounds human and agent beliefs in a shared latent task space. SIL integrates perception, language understanding, memory, and action with uncertainty-aware parsing and memory safeguards (EWC) to support continual, mutual adaptation. It formalizes belief co-evolution, implements episodic and semantic memory, and demonstrates improved task completion and belief alignment across five embodied-task domains in both simulation and the real world. The work advances robust, long-horizon human–robot collaboration by enabling proactive clarification, personalized interaction, and persistent knowledge retention, with public demos and resources for replication.
Abstract
Today's autonomous agents can understand free-form natural language instructions and execute long-horizon tasks in a manner akin to human-level reasoning. These capabilities are mostly driven by large-scale pre-trained foundation models (FMs). However, the approaches with which these models are grounded for human-robot interaction (HRI) perpetuate a master-apprentice model, where the apprentice (embodied agent) passively receives and executes the master's (human's) commands without reciprocal learning. This reactive interaction approach does not capture the co-adaptive dynamics inherent in everyday multi-turn human-human interactions. To address this, we propose a Symbiotic Interactive Learning (SIL) approach that enables both the master and the apprentice to co-adapt through mutual, bidirectional interactions. We formalised SIL as a co-adaptation process within a shared latent task space, where the agent and human maintain joint belief states that evolve based on interaction history. This enables the agent to move beyond reactive execution to proactive clarification, adaptive suggestions, and shared plan refinement. To realise these novel behaviours, we leveraged pre-trained FMs for spatial perception and reasoning, alongside a lightweight latent encoder that grounds the models' outputs into task-specific representations. Furthermore, to ensure stability as the tasks evolve, we augment SIL with a memory architecture that prevents the forgetting of learned task-space representations. We validate SIL on both simulated and real-world embodied tasks, including instruction following, information retrieval, query-oriented reasoning, and interactive dialogues. Demos and resources are public at:~\href{https://linusnep.github.io/SIL/}{https://linusnep.github.io/SIL/}.
