Table of Contents
Fetching ...

Scaling 3D Reasoning with LMMs to Large Robot Mission Environments Using Datagraphs

W. J. Meijer, A. C. Kemmeren, E. H. J. Riemens, J. E. Fransman, M. van Bekkum, G. J. Burghouts, J. D. van Mil

TL;DR

The paper tackles scaling Large Multimodal Models to large 3D mission environments where fixed context windows hinder global reasoning, and introduces a datagraph $G=(V,E)$ with nodes $v=(x_v,s_v)$ and edges $e_{ij}=(v_i,v_j)$ to support iterative LMM prompting. It presents two traversal strategies: a proximity-based approach and a path-based querying approach, enabling targeted reasoning over small, local regions rather than the entire environment. The contributions include a formal datagraph model, two prioritized local querying algorithms, and modality-agnostic data storage (e.g., 3D scenes, Gaussian splats, or images). The approach promises scalable reasoning for time-critical robotic missions by reducing context requirements while maintaining rich environmental information.

Abstract

This paper addresses the challenge of scaling Large Multimodal Models (LMMs) to expansive 3D environments. Solving this open problem is especially relevant for robot deployment in many first-responder scenarios, such as search-and-rescue missions that cover vast spaces. The use of LMMs in these settings is currently hampered by the strict context windows that limit the LMM's input size. We therefore introduce a novel approach that utilizes a datagraph structure, which allows the LMM to iteratively query smaller sections of a large environment. Using the datagraph in conjunction with graph traversal algorithms, we can prioritize the most relevant locations to the query, thereby improving the scalability of 3D scene language tasks. We illustrate the datagraph using 3D scenes, but these can be easily substituted by other dense modalities that represent the environment, such as pointclouds or Gaussian splats. We demonstrate the potential to use the datagraph for two 3D scene language task use cases, in a search-and-rescue mission example.

Scaling 3D Reasoning with LMMs to Large Robot Mission Environments Using Datagraphs

TL;DR

The paper tackles scaling Large Multimodal Models to large 3D mission environments where fixed context windows hinder global reasoning, and introduces a datagraph with nodes and edges to support iterative LMM prompting. It presents two traversal strategies: a proximity-based approach and a path-based querying approach, enabling targeted reasoning over small, local regions rather than the entire environment. The contributions include a formal datagraph model, two prioritized local querying algorithms, and modality-agnostic data storage (e.g., 3D scenes, Gaussian splats, or images). The approach promises scalable reasoning for time-critical robotic missions by reducing context requirements while maintaining rich environmental information.

Abstract

This paper addresses the challenge of scaling Large Multimodal Models (LMMs) to expansive 3D environments. Solving this open problem is especially relevant for robot deployment in many first-responder scenarios, such as search-and-rescue missions that cover vast spaces. The use of LMMs in these settings is currently hampered by the strict context windows that limit the LMM's input size. We therefore introduce a novel approach that utilizes a datagraph structure, which allows the LMM to iteratively query smaller sections of a large environment. Using the datagraph in conjunction with graph traversal algorithms, we can prioritize the most relevant locations to the query, thereby improving the scalability of 3D scene language tasks. We illustrate the datagraph using 3D scenes, but these can be easily substituted by other dense modalities that represent the environment, such as pointclouds or Gaussian splats. We demonstrate the potential to use the datagraph for two 3D scene language task use cases, in a search-and-rescue mission example.
Paper Structure (7 sections, 4 figures, 2 algorithms)

This paper contains 7 sections, 4 figures, 2 algorithms.

Figures (4)

  • Figure 1: On the left is a continuous multi-room 3D scene. On the right, a graph (red) is extended with 3D scenes at each node to form a datagraph. Instead of processing the whole multi-room 3D scene at once, existing LMMs can use the smaller scenes in the datagraph to iteratively cover large areas.
  • Figure 2: There is a variety of 3D tasks we would like to perform with 3D-LMMs as illustrated in the work of hong_3d-llm_2023. We investigate how to perform such tasks over expansive environments with existing LMMs, which are limited in their context size.
  • Figure 3: Illustration of the iterative spatially grounded LMM querying algorithm. Search is started at the node the agent is at (blue), then it gradually expands outward to retrieve perception scenes that can be queried by the LMM.
  • Figure 4: Illustration of 3D scene language tasks along a navigation path. The purple node is the navigation goal and the blue node is where the agent is located. Two routes A and B are possible, highlighted with blue edges in the two figures on the right.