Table of Contents
Fetching ...

ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation

Abrar Anwar, John Welsh, Joydeep Biswas, Soha Pouya, Yan Chang

TL;DR

The paper tackles long-horizon memory and reasoning for robot navigation by introducing ReMEmbR, a retrieval-augmented memory system that builds a queryable memory from video captions and spatial-temporal data. It formulates the problem as memory building plus a Python-like LLM-driven querying loop that retrieves a compact history to answer questions and generate navigational goals. The NaVQA dataset provides a benchmark for spatial, temporal, and descriptive questions over extended robot histories, with results showing strong long-horizon reasoning and lower latency than baselines, including VLM-based approaches. Real-world deployment validates practicality while highlighting captioning fidelity as an area for improvement. Overall, the work offers a scalable framework for long-horizon perception and queryable memory in embodied agents.

Abstract

Navigating and understanding complex environments over extended periods of time is a significant challenge for robots. People interacting with the robot may want to ask questions like where something happened, when it occurred, or how long ago it took place, which would require the robot to reason over a long history of their deployment. To address this problem, we introduce a Retrieval-augmented Memory for Embodied Robots, or ReMEmbR, a system designed for long-horizon video question answering for robot navigation. To evaluate ReMEmbR, we introduce the NaVQA dataset where we annotate spatial, temporal, and descriptive questions to long-horizon robot navigation videos. ReMEmbR employs a structured approach involving a memory building and a querying phase, leveraging temporal information, spatial information, and images to efficiently handle continuously growing robot histories. Our experiments demonstrate that ReMEmbR outperforms LLM and VLM baselines, allowing ReMEmbR to achieve effective long-horizon reasoning with low latency. Additionally, we deploy ReMEmbR on a robot and show that our approach can handle diverse queries. The dataset, code, videos, and other material can be found at the following link: https://nvidia-ai-iot.github.io/remembr

ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation

TL;DR

The paper tackles long-horizon memory and reasoning for robot navigation by introducing ReMEmbR, a retrieval-augmented memory system that builds a queryable memory from video captions and spatial-temporal data. It formulates the problem as memory building plus a Python-like LLM-driven querying loop that retrieves a compact history to answer questions and generate navigational goals. The NaVQA dataset provides a benchmark for spatial, temporal, and descriptive questions over extended robot histories, with results showing strong long-horizon reasoning and lower latency than baselines, including VLM-based approaches. Real-world deployment validates practicality while highlighting captioning fidelity as an area for improvement. Overall, the work offers a scalable framework for long-horizon perception and queryable memory in embodied agents.

Abstract

Navigating and understanding complex environments over extended periods of time is a significant challenge for robots. People interacting with the robot may want to ask questions like where something happened, when it occurred, or how long ago it took place, which would require the robot to reason over a long history of their deployment. To address this problem, we introduce a Retrieval-augmented Memory for Embodied Robots, or ReMEmbR, a system designed for long-horizon video question answering for robot navigation. To evaluate ReMEmbR, we introduce the NaVQA dataset where we annotate spatial, temporal, and descriptive questions to long-horizon robot navigation videos. ReMEmbR employs a structured approach involving a memory building and a querying phase, leveraging temporal information, spatial information, and images to efficiently handle continuously growing robot histories. Our experiments demonstrate that ReMEmbR outperforms LLM and VLM baselines, allowing ReMEmbR to achieve effective long-horizon reasoning with low latency. Additionally, we deploy ReMEmbR on a robot and show that our approach can handle diverse queries. The dataset, code, videos, and other material can be found at the following link: https://nvidia-ai-iot.github.io/remembr
Paper Structure (9 sections, 3 equations, 5 figures, 2 tables)

This paper contains 9 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Robots continuously operate for long periods of time, where they gather long histories. In this work, we investigate how to aggregate these robot histories over time efficiently, and how to utilize that memory representation for answering spatio-temporal questions and generating navigational goals.
  • Figure 2: (Left) We design ReMEmbR with a memory building phase and a querying phase. The memory building phase runs a VILA lin2024vila video captioning model, embeds the caption, then stores the caption embedding, position, and time vectors into a vector database. Then, when a user asks a question, a vector database querying loop starts with an LLM. (Right) Then, we evaluate ReMEmbR on the NaVQA dataset which we construct. NaVQA consists of three types of questions as shown above. Then we deploy ReMEmbR on a robot.
  • Figure 3: We introduce the NaVQA dataset, which is composed of $210$ examples across three different time ranges up to 20 minutes in length. The dataset consists of spatial, temporal, and descriptive questions, each of which has different types of outputs as shown above.
  • Figure 4: Overall correctness over time. We discretize time into 4 bins and average overall correctness scores in each. Note that although the Medium category in Table \ref{['tab:correctness']} is incomplete, some test instances did complete. We find that ReMEmbR is more correct as the amount of time increases.
  • Figure 5: Robot deployment. We deploy ReMEmbR on a Nova Carter robot. We run the memory building phase for 25 minutes, and then begin to ask navigation-centric questions The robot successfully handles various instructions, including those with more ambiguous instructions such as going to somewhere with a nice view. However, we found that ReMEmbR often confuses some objects such as soda machines and water fountains, leading to incorrect goals.