Table of Contents
Fetching ...

Beyond Master and Apprentice: Grounding Foundation Models for Symbiotic Interactive Learning in a Shared Latent Space

Linus Nwankwo, Björn Ellensohn, Christian Rauch, Elmar Rueckert

TL;DR

The paper addresses the master-apprentice limitations in language-conditioned HRI by proposing Symbiotic Interactive Learning (SIL), a bidirectional, co-adaptive framework that grounds human and agent beliefs in a shared latent task space. SIL integrates perception, language understanding, memory, and action with uncertainty-aware parsing and memory safeguards (EWC) to support continual, mutual adaptation. It formalizes belief co-evolution, implements episodic and semantic memory, and demonstrates improved task completion and belief alignment across five embodied-task domains in both simulation and the real world. The work advances robust, long-horizon human–robot collaboration by enabling proactive clarification, personalized interaction, and persistent knowledge retention, with public demos and resources for replication.

Abstract

Today's autonomous agents can understand free-form natural language instructions and execute long-horizon tasks in a manner akin to human-level reasoning. These capabilities are mostly driven by large-scale pre-trained foundation models (FMs). However, the approaches with which these models are grounded for human-robot interaction (HRI) perpetuate a master-apprentice model, where the apprentice (embodied agent) passively receives and executes the master's (human's) commands without reciprocal learning. This reactive interaction approach does not capture the co-adaptive dynamics inherent in everyday multi-turn human-human interactions. To address this, we propose a Symbiotic Interactive Learning (SIL) approach that enables both the master and the apprentice to co-adapt through mutual, bidirectional interactions. We formalised SIL as a co-adaptation process within a shared latent task space, where the agent and human maintain joint belief states that evolve based on interaction history. This enables the agent to move beyond reactive execution to proactive clarification, adaptive suggestions, and shared plan refinement. To realise these novel behaviours, we leveraged pre-trained FMs for spatial perception and reasoning, alongside a lightweight latent encoder that grounds the models' outputs into task-specific representations. Furthermore, to ensure stability as the tasks evolve, we augment SIL with a memory architecture that prevents the forgetting of learned task-space representations. We validate SIL on both simulated and real-world embodied tasks, including instruction following, information retrieval, query-oriented reasoning, and interactive dialogues. Demos and resources are public at:~\href{https://linusnep.github.io/SIL/}{https://linusnep.github.io/SIL/}.

Beyond Master and Apprentice: Grounding Foundation Models for Symbiotic Interactive Learning in a Shared Latent Space

TL;DR

The paper addresses the master-apprentice limitations in language-conditioned HRI by proposing Symbiotic Interactive Learning (SIL), a bidirectional, co-adaptive framework that grounds human and agent beliefs in a shared latent task space. SIL integrates perception, language understanding, memory, and action with uncertainty-aware parsing and memory safeguards (EWC) to support continual, mutual adaptation. It formalizes belief co-evolution, implements episodic and semantic memory, and demonstrates improved task completion and belief alignment across five embodied-task domains in both simulation and the real world. The work advances robust, long-horizon human–robot collaboration by enabling proactive clarification, personalized interaction, and persistent knowledge retention, with public demos and resources for replication.

Abstract

Today's autonomous agents can understand free-form natural language instructions and execute long-horizon tasks in a manner akin to human-level reasoning. These capabilities are mostly driven by large-scale pre-trained foundation models (FMs). However, the approaches with which these models are grounded for human-robot interaction (HRI) perpetuate a master-apprentice model, where the apprentice (embodied agent) passively receives and executes the master's (human's) commands without reciprocal learning. This reactive interaction approach does not capture the co-adaptive dynamics inherent in everyday multi-turn human-human interactions. To address this, we propose a Symbiotic Interactive Learning (SIL) approach that enables both the master and the apprentice to co-adapt through mutual, bidirectional interactions. We formalised SIL as a co-adaptation process within a shared latent task space, where the agent and human maintain joint belief states that evolve based on interaction history. This enables the agent to move beyond reactive execution to proactive clarification, adaptive suggestions, and shared plan refinement. To realise these novel behaviours, we leveraged pre-trained FMs for spatial perception and reasoning, alongside a lightweight latent encoder that grounds the models' outputs into task-specific representations. Furthermore, to ensure stability as the tasks evolve, we augment SIL with a memory architecture that prevents the forgetting of learned task-space representations. We validate SIL on both simulated and real-world embodied tasks, including instruction following, information retrieval, query-oriented reasoning, and interactive dialogues. Demos and resources are public at:~\href{https://linusnep.github.io/SIL/}{https://linusnep.github.io/SIL/}.

Paper Structure

This paper contains 22 sections, 11 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: (a) The traditional master-apprentice model places the entire reasoning burden on the user (e.g., context, memory), requiring precise and unambiguous instructions for passive execution. In contrast, SIL enables co-adaptive interaction, in which both participants iteratively update their shared latent beliefs to reduce ambiguity and cognitive load. (b) An example of SIL's contextual grounding: upon receiving an ambiguous instruction, a clarification dialogue was triggered. The agent offers candidate interpretations based on prior interactions, resolves the intent, and executes the navigation task (yellow path).
  • Figure 2: Overview of SIL's architecture. Human instructions are received through the natural-language interaction interface (A) and passed to the LLM ensemble for intent parsing (B & C). Internally, the agent maintains belief states in a shared latent task space. This is updated through co-adaptation dynamics and aligned via cosine similarity (D, E, & F). Visual grounding is achieved through pre-trained vision–language models that segment and project objects into 3D coordinates (G). Action plans are executed through the action executor (H) while providing feedback in the form of progress updates, error reporting, and adaptive suggestions. The memory architecture ensures continual adaptation over time.
  • Figure 3: Belief alignment ($\rho$) across multi-turn interactions. Full SIL (blue) exhibits rapid convergence toward a stable equilibrium $\rho \approx 0.83$, maintaining high alignment throughout. In contrast, ablations without co-adaptation, EWC, human preference modelling, memory, or uncertainty handling exhibit unstable trajectories $(\rho \approx 0.52 - 0.65)$ and fail to achieve strong alignment.
  • Figure 4: Task success rate across domains and ablated variants. Full SIL consistently outperforms all ablations, achieving near-ceiling performance on LPL, MIIR, and PDS. The worst performance arises when co-adaptation and EWC are disabled, confirming their critical role. Memory, human preference modelling, and uncertainty contribute smaller but significant improvements, particularly in context-heavy and personalisation-sensitive tasks. Error bars show standard deviation across trials.
  • Figure 5: Qualitative examples of SIL in multi-turn interaction tasks. Yellow paths indicate the agent’s navigation trajectories, starting from the origin $(x=0,y=0,z=0)$. (a) The user issues a conditional navigation command requiring logical reasoning over spatial constraints; SIL computes the round-trip time and executes the correct policy. (b) The user probes anti-forgetting; SIL recalls and reproduces a previously executed navigation sequence, showing stable task memory. (c) The user teaches a new preference ("repeat previous task" implies returning to the origin and drawing a circle). SIL encodes this personalisation and applies it correctly in subsequent interactions, demonstrating preference retention and continual learning.