STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory
Mingfeng Yuan, Hao Zhang, Mahan Mohammadi, Runhao Li, Jinjun Shan, Steven L. Waslander
TL;DR
STaR tackles long-horizon open-world robotic memory by building OmniMem, a multimodal memory with captions $\mathcal{C}_{1:T}$, 3D primitives $\mathcal{X}_{1:T}$, and keyframes $\mathcal{K}_{1:T}$, and applying an Information Bottleneck-based Scalable Task-Conditioned Retrieval to distill a compact $R$ that preserves $p(A|Q,M)$. An agentic RAG workflow reasons over multimodal input, plans search strategies, and grounds answers in retrieved memories to support navigation and manipulation. Empirical results on NaVQA and WH-VQA, plus real Husky deployments, show STaR achieves higher success rates and lower spatial error than baselines, confirming scalability and practical utility in long-horizon robot memory. The approach enables robust, context-aware memory retrieval and reasoning for open-ended tasks, with promising implications for autonomous navigation and manipulation in dynamic environments.
Abstract
Mobile robots are often deployed over long durations in diverse open, dynamic scenes, including indoor setting such as warehouses and manufacturing facilities, and outdoor settings such as agricultural and roadway operations. A core challenge is to build a scalable long-horizon memory that supports an agentic workflow for planning, retrieval, and reasoning over open-ended instructions at variable granularity, while producing precise, actionable answers for navigation. We present STaR, an agentic reasoning framework that (i) constructs a task-agnostic, multimodal long-term memory that generalizes to unseen queries while preserving fine-grained environmental semantics (object attributes, spatial relations, and dynamic events), and (ii) introduces a Scalable TaskConditioned Retrieval algorithm based on the Information Bottleneck principle to extract from long-term memory a compact, non-redundant, information-rich set of candidate memories for contextual reasoning. We evaluate STaR on NaVQA (mixed indoor/outdoor campus scenes) and WH-VQA, a customized warehouse benchmark with many visually similar objects built with Isaac Sim, emphasizing contextual reasoning. Across the two datasets, STaR consistently outperforms strong baselines, achieving higher success rates and markedly lower spatial error. We further deploy STaR on a real Husky wheeled robot in both indoor and outdoor environments, demonstrating robust longhorizon reasoning, scalability, and practical utility.
