Episodic Memory Verbalization using Hierarchical Representations of Life-Long Robot Experience
Leonard Bärmann, Chad DeChant, Joana Plewnia, Fabian Peller-Konrad, Daniel Bauer, Tamim Asfour, Alex Waibel
TL;DR
The paper addresses verbalizing and querying life-long robotic experiences by introducing a hierarchical episodic memory (EM) structure and using a large language model as an interactive agent to explore and summarize past events. The method constructs a multi-level memory tree (L0–L4+), where higher levels are generated by recursive LLM prompts, and accesses the memory via an agent that expands relevant nodes and calls tools to minimize token costs. Extensive evaluations on TEACh simulations, Ego4D egocentric videos, and real-world Armar-7 data demonstrate scalable, token-efficient QA over long histories, with hierarchical summaries providing robustness to noise and data volume. The work highlights significant potential for natural human-robot interaction while outlining limitations related to retrieval accuracy and error propagation across components, guiding future improvements in multimodal, personalized lifelong memory systems.
Abstract
Verbalization of robot experience, i.e., summarization of and question answering about a robot's past, is a crucial ability for improving human-robot interaction. Previous works applied rule-based systems or fine-tuned deep models to verbalize short (several-minute-long) streams of episodic data, limiting generalization and transferability. In our work, we apply large pretrained models to tackle this task with zero or few examples, and specifically focus on verbalizing life-long experiences. For this, we derive a tree-like data structure from episodic memory (EM), with lower levels representing raw perception and proprioception data, and higher levels abstracting events to natural language concepts. Given such a hierarchical representation built from the experience stream, we apply a large language model as an agent to interactively search the EM given a user's query, dynamically expanding (initially collapsed) tree nodes to find the relevant information. The approach keeps computational costs low even when scaling to months of robot experience data. We evaluate our method on simulated household robot data, human egocentric videos, and real-world robot recordings, demonstrating its flexibility and scalability.
