STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory

Mingfeng Yuan; Hao Zhang; Mahan Mohammadi; Runhao Li; Jinjun Shan; Steven L. Waslander

STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory

Mingfeng Yuan, Hao Zhang, Mahan Mohammadi, Runhao Li, Jinjun Shan, Steven L. Waslander

TL;DR

STaR tackles long-horizon open-world robotic memory by building OmniMem, a multimodal memory with captions $\mathcal{C}_{1:T}$, 3D primitives $\mathcal{X}_{1:T}$, and keyframes $\mathcal{K}_{1:T}$, and applying an Information Bottleneck-based Scalable Task-Conditioned Retrieval to distill a compact $R$ that preserves $p(A|Q,M)$. An agentic RAG workflow reasons over multimodal input, plans search strategies, and grounds answers in retrieved memories to support navigation and manipulation. Empirical results on NaVQA and WH-VQA, plus real Husky deployments, show STaR achieves higher success rates and lower spatial error than baselines, confirming scalability and practical utility in long-horizon robot memory. The approach enables robust, context-aware memory retrieval and reasoning for open-ended tasks, with promising implications for autonomous navigation and manipulation in dynamic environments.

Abstract

Mobile robots are often deployed over long durations in diverse open, dynamic scenes, including indoor setting such as warehouses and manufacturing facilities, and outdoor settings such as agricultural and roadway operations. A core challenge is to build a scalable long-horizon memory that supports an agentic workflow for planning, retrieval, and reasoning over open-ended instructions at variable granularity, while producing precise, actionable answers for navigation. We present STaR, an agentic reasoning framework that (i) constructs a task-agnostic, multimodal long-term memory that generalizes to unseen queries while preserving fine-grained environmental semantics (object attributes, spatial relations, and dynamic events), and (ii) introduces a Scalable TaskConditioned Retrieval algorithm based on the Information Bottleneck principle to extract from long-term memory a compact, non-redundant, information-rich set of candidate memories for contextual reasoning. We evaluate STaR on NaVQA (mixed indoor/outdoor campus scenes) and WH-VQA, a customized warehouse benchmark with many visually similar objects built with Isaac Sim, emphasizing contextual reasoning. Across the two datasets, STaR consistently outperforms strong baselines, achieving higher success rates and markedly lower spatial error. We further deploy STaR on a real Husky wheeled robot in both indoor and outdoor environments, demonstrating robust longhorizon reasoning, scalability, and practical utility.

STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory

TL;DR

STaR tackles long-horizon open-world robotic memory by building OmniMem, a multimodal memory with captions

, 3D primitives

, and keyframes

, and applying an Information Bottleneck-based Scalable Task-Conditioned Retrieval to distill a compact

that preserves

. An agentic RAG workflow reasons over multimodal input, plans search strategies, and grounds answers in retrieved memories to support navigation and manipulation. Empirical results on NaVQA and WH-VQA, plus real Husky deployments, show STaR achieves higher success rates and lower spatial error than baselines, confirming scalability and practical utility in long-horizon robot memory. The approach enables robust, context-aware memory retrieval and reasoning for open-ended tasks, with promising implications for autonomous navigation and manipulation in dynamic environments.

Abstract

Paper Structure (15 sections, 11 equations, 6 figures, 2 tables)

This paper contains 15 sections, 11 equations, 6 figures, 2 tables.

INTRODUCTION
Related works
Problem Formulation
Methodology
Building OmniMem
Scalable Task-Conditioned Retrieval
Experimental Setup
NaVQA dataset
WH-VQA dataset
Evaluation Metrics
Results
NaVQA results
WH-VQA dataset
Implementation Details and On-Device Deployment
CONCLUSION

Figures (6)

Figure 1: STaR System Overview. Our framework consists of three stages. (Left) Memory construction: the robot records RGB and posed depth data to build a multimodal memory composed of three complementary databases (DB) -- video caption, 3D primitive, and visual keyframe -- jointly forming OmniMem. (Middle) User query and reasoning: given text or multimodal queries, an agentic planner (MLLM) retrieves task-relevant memories through an Information Bottleneck, performs contextual reasoning, and outputs structured answers (location, time, or description). (Right) Evaluation: We evaluate STaR on both the NaVQA dataset (campus) and the WH-VQA dataset (warehouse), which cover spatial, temporal, and descriptive question types across short-, medium-, and long-term memory settings. The evaluation examines three key capabilities-long horizon cross-modal memory construction, task-conditioned memory retrieval, and contextual reasoning. We also validate the multi-modal query and navigation tasks in a warehouse simulated with Isaac Sim.
Figure 2: Task-conditioned retrieval and contextual reasoning. Given an open-ended query, we embed its cues and query the DB to retrieve above-threshold video captions with timestamps and detected objects (Caption–Induced Primitive). These captions induce a working set of primitives $\mathcal{X}'_Q$, on which we run IB to merge neighboring primitives into compact, task-relevant clusters. We then group captions by cluster and select one representative caption per cluster to form a non-redundant evidence set. From these memories, the robot optionally loads keyframe images to resolve fine-grained details, performs contextual reasoning, and outputs actionable answers---e.g., locations of white “Digital Twin” boxes in the staging area and shelf indices with remaining pallet slots—supporting navigation and Q&A.
Figure 3: Qualitative example. Left: 3D reconstruction for a 16-min memory horizon with the robot’s current pose (red) and three candidate answers. Task-relevant 3D primitives selected by our IB-based clustering are highlighted; task-irrelevant 3D primitives are masked (not shown). Right: temporal retrieval. STaR selects diverse, non-redundant captions and correctly grounds the “yellow pole with POLICE HELP sign,” choosing Candidate 2 as the nearest police pole. In contrast, ReMEmbR (bottom) repeatedly retrieves redundant captions at timestep 13:11 and fails to identify the correct nearest target.
Figure 4: Scalability of STaR on the NaVQA dataset with increasing memory length (from 36 seconds to 35.9 minutes): (a) Overall success rate; (b) Runtime breakdown of STaR.
Figure 5: Qualitative example. Left: IB-based clustering highlights task-relevant 3D primitives; cues “Digital Twin box” / “white box” guide retrieval, with irrelevant regions masked. Right: our cluster-wise grouping + keyframe selection retrieves diverse, non-redundant memories and correctly grounds Memory 4 (shelf area). ReMEmbR over-samples top-6 cosine hits near 13:43 (Memory 5).
...and 1 more figures

STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory

TL;DR

Abstract

STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory

Authors

TL;DR

Abstract

Table of Contents

Figures (6)