Table of Contents
Fetching ...

Searching in Space and Time: Unified Memory-Action Loops for Open-World Object Retrieval

Taijing Chen, Sateesh Kumar, Junhong Xu, Georgios Pavlakos, Joydeep Biswas, Roberto Martín-Martín

TL;DR

The paper addresses open-world object retrieval in dynamic environments by unifying memory-driven recall and embodied search within a single decision loop. STAR uses a non-parametric long-term memory with semantic, temporal, and spatial indices, plus a working memory that is updated through spatial actions and memory queries, all steered by an LLM policy over a unified action space. The authors introduce STARBench to benchmark spatiotemporal object search and show that STAR consistently outperforms baselines on attribute, spatial, and temporal reasoning tasks, with successful real-robot transfer to a Tiago platform. This work advances practical retrieval in evolving surroundings by enabling joint reasoning about past and present states, with implications for robust service robotics and open-vocabulary understanding in dynamic domains.

Abstract

Service robots must retrieve objects in dynamic, open-world settings where requests may reference attributes ("the red mug"), spatial context ("the mug on the table"), or past states ("the mug that was here yesterday"). Existing approaches capture only parts of this problem: scene graphs capture spatial relations but ignore temporal grounding, temporal reasoning methods model dynamics but do not support embodied interaction, and dynamic scene graphs handle both but remain closed-world with fixed vocabularies. We present STAR (SpatioTemporal Active Retrieval), a framework that unifies memory queries and embodied actions within a single decision loop. STAR leverages non-parametric long-term memory and a working memory to support efficient recall, and uses a vision-language model to select either temporal or spatial actions at each step. We introduce STARBench, a benchmark of spatiotemporal object search tasks across simulated and real environments. Experiments in STARBench and on a Tiago robot show that STAR consistently outperforms scene-graph and memory-only baselines, demonstrating the benefits of treating search in time and search in space as a unified problem. For more information: https://amrl.cs.utexas.edu/STAR.

Searching in Space and Time: Unified Memory-Action Loops for Open-World Object Retrieval

TL;DR

The paper addresses open-world object retrieval in dynamic environments by unifying memory-driven recall and embodied search within a single decision loop. STAR uses a non-parametric long-term memory with semantic, temporal, and spatial indices, plus a working memory that is updated through spatial actions and memory queries, all steered by an LLM policy over a unified action space. The authors introduce STARBench to benchmark spatiotemporal object search and show that STAR consistently outperforms baselines on attribute, spatial, and temporal reasoning tasks, with successful real-robot transfer to a Tiago platform. This work advances practical retrieval in evolving surroundings by enabling joint reasoning about past and present states, with implications for robust service robotics and open-vocabulary understanding in dynamic domains.

Abstract

Service robots must retrieve objects in dynamic, open-world settings where requests may reference attributes ("the red mug"), spatial context ("the mug on the table"), or past states ("the mug that was here yesterday"). Existing approaches capture only parts of this problem: scene graphs capture spatial relations but ignore temporal grounding, temporal reasoning methods model dynamics but do not support embodied interaction, and dynamic scene graphs handle both but remain closed-world with fixed vocabularies. We present STAR (SpatioTemporal Active Retrieval), a framework that unifies memory queries and embodied actions within a single decision loop. STAR leverages non-parametric long-term memory and a working memory to support efficient recall, and uses a vision-language model to select either temporal or spatial actions at each step. We introduce STARBench, a benchmark of spatiotemporal object search tasks across simulated and real environments. Experiments in STARBench and on a Tiago robot show that STAR consistently outperforms scene-graph and memory-only baselines, demonstrating the benefits of treating search in time and search in space as a unified problem. For more information: https://amrl.cs.utexas.edu/STAR.

Paper Structure

This paper contains 11 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: A fundamental skill for a robot is to retrieve desired objects, often specified through a combination of spatial and temporal references (e.g., "bring the book that was on the desk yesterday"). STAR is a framework that integrates long-term memory with an agentic search engine, enabling the robot to decide whether to search in time (recall past observations) or in space (probe the current environment), producing action sequences (2 to 6) to obtain the necessary information to retrieve the correct object.
  • Figure 2: STAR system. The robot patrols dynamic environments over multiple days to build a non-parametric long-term memory of past observations (left). When the user requests “bring me the book that was on the study desk yesterday”, the agent initializes its working memory with the task. Guided by this working memory, STAR chooses actions from a unified space: recalling past observations (search in time) or probing the current environment through navigation, perception, and manipulation (search in space). Each outcome updates the working memory, and the loop continues until the robot successfully retrieves the target object.
  • Figure 3: Task families in STARBench. Each row shows one task family with its instruction (left), the agent’s prior observations (middle), and the correct target object (right). Bounding boxes indicate locations of interest including target object positions. Curr. Observation (Current agent observation). In the class-based case, the folder seen at $t=52$ is the target object. In the attribute-based case, the instruction is “find the green folder”; although a black folder is visible at $t=5$, the correct target is the green folder first observed at $t=22$. In the spatial case, the toy observed on the bed at $t=74$ is the target. In the spatial-temporal case, the folder was seen one day earlier next to the grey chair at $t=8$ later moved, and its new position at $t=115$ is the target. In the spatial-frequentist case, the black folder most often found next to the grey chair but last observed at $t=143$ in a new location is the target.
  • Figure 4: Mock Apartment for real-world evaluations. We construct a mock apartment with a kitchen, a living room, and a study area.
  • Figure 5: Execution success rates across five task types of Visible Object Search tasks in STARBench (45 tasks per type). Bars indicate approaches; hatching denotes the environment knowledge used to construct long-term memory. Oracle builds memory with ground-truth object class labels; Realistic builds memory from model predictions only. SG+S uses full scene-graph history for a one-shot attempt; TR+S queries non-parametric memory for a one-shot attempt; STAR (ours) combines temporal retrieval with spatial search and achieves the highest success across task types.
  • ...and 1 more figures