Table of Contents
Fetching ...

Episodic Memory Verbalization using Hierarchical Representations of Life-Long Robot Experience

Leonard Bärmann, Chad DeChant, Joana Plewnia, Fabian Peller-Konrad, Daniel Bauer, Tamim Asfour, Alex Waibel

TL;DR

The paper addresses verbalizing and querying life-long robotic experiences by introducing a hierarchical episodic memory (EM) structure and using a large language model as an interactive agent to explore and summarize past events. The method constructs a multi-level memory tree (L0–L4+), where higher levels are generated by recursive LLM prompts, and accesses the memory via an agent that expands relevant nodes and calls tools to minimize token costs. Extensive evaluations on TEACh simulations, Ego4D egocentric videos, and real-world Armar-7 data demonstrate scalable, token-efficient QA over long histories, with hierarchical summaries providing robustness to noise and data volume. The work highlights significant potential for natural human-robot interaction while outlining limitations related to retrieval accuracy and error propagation across components, guiding future improvements in multimodal, personalized lifelong memory systems.

Abstract

Verbalization of robot experience, i.e., summarization of and question answering about a robot's past, is a crucial ability for improving human-robot interaction. Previous works applied rule-based systems or fine-tuned deep models to verbalize short (several-minute-long) streams of episodic data, limiting generalization and transferability. In our work, we apply large pretrained models to tackle this task with zero or few examples, and specifically focus on verbalizing life-long experiences. For this, we derive a tree-like data structure from episodic memory (EM), with lower levels representing raw perception and proprioception data, and higher levels abstracting events to natural language concepts. Given such a hierarchical representation built from the experience stream, we apply a large language model as an agent to interactively search the EM given a user's query, dynamically expanding (initially collapsed) tree nodes to find the relevant information. The approach keeps computational costs low even when scaling to months of robot experience data. We evaluate our method on simulated household robot data, human egocentric videos, and real-world robot recordings, demonstrating its flexibility and scalability.

Episodic Memory Verbalization using Hierarchical Representations of Life-Long Robot Experience

TL;DR

The paper addresses verbalizing and querying life-long robotic experiences by introducing a hierarchical episodic memory (EM) structure and using a large language model as an interactive agent to explore and summarize past events. The method constructs a multi-level memory tree (L0–L4+), where higher levels are generated by recursive LLM prompts, and accesses the memory via an agent that expands relevant nodes and calls tools to minimize token costs. Extensive evaluations on TEACh simulations, Ego4D egocentric videos, and real-world Armar-7 data demonstrate scalable, token-efficient QA over long histories, with hierarchical summaries providing robustness to noise and data volume. The work highlights significant potential for natural human-robot interaction while outlining limitations related to retrieval accuracy and error propagation across components, guiding future improvements in multimodal, personalized lifelong memory systems.

Abstract

Verbalization of robot experience, i.e., summarization of and question answering about a robot's past, is a crucial ability for improving human-robot interaction. Previous works applied rule-based systems or fine-tuned deep models to verbalize short (several-minute-long) streams of episodic data, limiting generalization and transferability. In our work, we apply large pretrained models to tackle this task with zero or few examples, and specifically focus on verbalizing life-long experiences. For this, we derive a tree-like data structure from episodic memory (EM), with lower levels representing raw perception and proprioception data, and higher levels abstracting events to natural language concepts. Given such a hierarchical representation built from the experience stream, we apply a large language model as an agent to interactively search the EM given a user's query, dynamically expanding (initially collapsed) tree nodes to find the relevant information. The approach keeps computational costs low even when scaling to months of robot experience data. We evaluate our method on simulated household robot data, human egocentric videos, and real-world robot recordings, demonstrating its flexibility and scalability.
Paper Structure (11 sections, 7 figures, 2 tables)

This paper contains 11 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Our system answers queries about life-long experience of an agent (human or robotic) by exploring a tree representation of episodic memory.
  • Figure 2: From the continuous, multimodal stream of robotic experiences, we construct a history tree, a hierarchical representation of the EM.
  • Figure 3: To answer a user's question, H-Emv prompts an LLM to interactively explore the history tree containing the agent's experiences. The LLM can further invoke tools (index search, VLM) or perform other calculations to gather relevant information, eventually invoking the answer function. This figure shows an example from our real-world evaluation on the humanoid robot Armar-7, modified for illustrative purposes.
  • Figure 4: Token costs vs. performance for different history lengths (TEACh full multimodal). Solid lines are $S_p$, dashed lines $S_c$. H-Emv retains better performance at lower costs.
  • Figure 5: Example traces from our Armar-7 evaluation. Left: success case. Middle: partially correct, LLM misinterpreting the question. Right (shortened): wrong answer, expanded the wrong node (not the last time grasping the milk). More examples at https://hierarchical-emv.github.io
  • ...and 2 more figures