Table of Contents
Fetching ...

Describe Anything Anywhere At Any Moment

Nicolas Gorlo, Lukas Schmid, Luca Carlone

TL;DR

DAAAM tackles the need for a memory system that is both geometrically grounded and semantically rich for large-scale, real-time 4D scene understanding. It introduces an optimization-based frontend to select frames and annotate with a large captioning model in batch, then reconciles observations into a hierarchical 4D scene graph accessible to LLM-based agents via a tool-calling interface. The approach yields state-of-the-art results on spatio-temporal QA and sequential task grounding, with real-time operation at 10 Hz and scalability to long horizons and kilometers of travel. The work provides an open-source dataset and code, enabling broader adoption and benchmarking.

Abstract

Computer vision and robotics applications ranging from augmented reality to robot autonomy in large-scale environments require spatio-temporal memory frameworks that capture both geometric structure for accurate language-grounding as well as semantic detail. Existing methods face a tradeoff, where producing rich open-vocabulary descriptions comes at the expense of real-time performance when these descriptions have to be grounded in 3D. To address these challenges, we propose Describe Anything, Anywhere, at Any Moment (DAAAM), a novel spatio-temporal memory framework for large-scale and real-time 4D scene understanding. DAAAM introduces a novel optimization-based frontend to infer detailed semantic descriptions from localized captioning models, such as the Describe Anything Model (DAM), leveraging batch processing to speed up inference by an order of magnitude for online processing. It leverages such semantic understanding to build a hierarchical 4D scene graph (SG), which acts as an effective globally spatially and temporally consistent memory representation. DAAAM constructs 4D SGs with detailed, geometrically grounded descriptions while maintaining real-time performance. We show that DAAAM's 4D SG interfaces well with a tool-calling agent for inference and reasoning. We thoroughly evaluate DAAAM in the complex task of spatio-temporal question answering on the NaVQA benchmark and show its generalization capabilities for sequential task grounding on the SG3D benchmark. We further curate an extended OC-NaVQA benchmark for large-scale and long-time evaluations. DAAAM achieves state-of-the-art results in both tasks, improving OC-NaVQA question accuracy by 53.6%, position errors by 21.9%, temporal errors by 21.6%, and SG3D task grounding accuracy by 27.8% over the most competitive baselines, respectively. We release our data and code open-source.

Describe Anything Anywhere At Any Moment

TL;DR

DAAAM tackles the need for a memory system that is both geometrically grounded and semantically rich for large-scale, real-time 4D scene understanding. It introduces an optimization-based frontend to select frames and annotate with a large captioning model in batch, then reconciles observations into a hierarchical 4D scene graph accessible to LLM-based agents via a tool-calling interface. The approach yields state-of-the-art results on spatio-temporal QA and sequential task grounding, with real-time operation at 10 Hz and scalability to long horizons and kilometers of travel. The work provides an open-source dataset and code, enabling broader adoption and benchmarking.

Abstract

Computer vision and robotics applications ranging from augmented reality to robot autonomy in large-scale environments require spatio-temporal memory frameworks that capture both geometric structure for accurate language-grounding as well as semantic detail. Existing methods face a tradeoff, where producing rich open-vocabulary descriptions comes at the expense of real-time performance when these descriptions have to be grounded in 3D. To address these challenges, we propose Describe Anything, Anywhere, at Any Moment (DAAAM), a novel spatio-temporal memory framework for large-scale and real-time 4D scene understanding. DAAAM introduces a novel optimization-based frontend to infer detailed semantic descriptions from localized captioning models, such as the Describe Anything Model (DAM), leveraging batch processing to speed up inference by an order of magnitude for online processing. It leverages such semantic understanding to build a hierarchical 4D scene graph (SG), which acts as an effective globally spatially and temporally consistent memory representation. DAAAM constructs 4D SGs with detailed, geometrically grounded descriptions while maintaining real-time performance. We show that DAAAM's 4D SG interfaces well with a tool-calling agent for inference and reasoning. We thoroughly evaluate DAAAM in the complex task of spatio-temporal question answering on the NaVQA benchmark and show its generalization capabilities for sequential task grounding on the SG3D benchmark. We further curate an extended OC-NaVQA benchmark for large-scale and long-time evaluations. DAAAM achieves state-of-the-art results in both tasks, improving OC-NaVQA question accuracy by 53.6%, position errors by 21.9%, temporal errors by 21.6%, and SG3D task grounding accuracy by 27.8% over the most competitive baselines, respectively. We release our data and code open-source.

Paper Structure

This paper contains 13 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: We present Describe Anything, Anywhere, at Any Moment (DAAAM), a real-time, large-scale, spatio-temporal memory for embodied question answering and 4D reasoning. Given RGB-D sensor input DAAAM incrementally constructs a hierarchical 4D scene graph with highly detailed annotations that acts as an effective and scalable spatio-temporal memory representation for LLM Agents.
  • Figure 2: An overview of the proposed approach. Given an RGB-D video stream, we first segment the scene into fragments and track them over time in image space using a lightweight tracker Aharon22arxiv-botsort. We perform metric-semantic mapping using Hydra Hughes24ijrr-hydraFoundations with the Khronos Schmid24rss-khronos frontend on the unlabeled segments to build a 4D map of the environment. To semantically lift the resulting map, we aggregate the tracked observations in parallel and select frames using an optimization-based frame selection algorithm. The selected frames and segments are batch-processed by the Describe Anything Model (DAM) Lian25arxiv-DAM to generate detailed descriptions for each object. The generated descriptions are finally incorporated back into the map and a 4D scene graph is constructed and clustered into semantically informed regions.
  • Figure 3: Speedup of DAM Lian25arxiv-DAM inference via batching. Baseline (batch size = 1) dashed red, batch processing solid blue.
  • Figure A.1: Illustration of the steps of our traversability place extraction algorithm. The different stages are shown in figures (a)-(d) for a top-down view of the robot moving on a street.
  • Figure C.2: Mock-example for the frame selection heuristic.