Table of Contents
Fetching ...

MA3DSG: Multi-Agent 3D Scene Graph Generation for Large-Scale Indoor Environments

Yirum Kim, Jaewoo Kim, Ue-Hwan Kim

TL;DR

This work tackles the scalability gap in 3D scene graph generation by introducing MA3DSG, a training-free, multi-agent framework that incrementally builds local graphs and fuses them into a global 3D semantic scene graph through a lightweight graph alignment and update mechanism. It also presents MA3DSG-Bench, a comprehensive benchmark that simulates diverse agent configurations, domain sizes, and dynamic conditions to assess performance and scalability in large-scale indoor environments. Empirical results show MA3DSG achieving competitive accuracy with substantial gains in speed (up to 4x faster) and dramatic reductions in data traffic (up to ~98x) compared to multi-agent baselines, especially in dynamic LDCP scenarios. The work lays a foundation for scalable, real-world multi-agent 3DSGG systems and provides a practical benchmark for future research.

Abstract

Current 3D scene graph generation (3DSGG) approaches heavily rely on a single-agent assumption and small-scale environments, exhibiting limited scalability to real-world scenarios. In this work, we introduce Multi-Agent 3D Scene Graph Generation (MA3DSG) model, the first framework designed to tackle this scalability challenge using multiple agents. We develop a training-free graph alignment algorithm that efficiently merges partial query graphs from individual agents into a unified global scene graph. Leveraging extensive analysis and empirical insights, our approach enables conventional single-agent systems to operate collaboratively without requiring any learnable parameters. To rigorously evaluate 3DSGG performance, we propose MA3DSG-Bench-a benchmark that supports diverse agent configurations, domain sizes, and environmental conditions-providing a more general and extensible evaluation framework. This work lays a solid foundation for scalable, multi-agent 3DSGG research.

MA3DSG: Multi-Agent 3D Scene Graph Generation for Large-Scale Indoor Environments

TL;DR

This work tackles the scalability gap in 3D scene graph generation by introducing MA3DSG, a training-free, multi-agent framework that incrementally builds local graphs and fuses them into a global 3D semantic scene graph through a lightweight graph alignment and update mechanism. It also presents MA3DSG-Bench, a comprehensive benchmark that simulates diverse agent configurations, domain sizes, and dynamic conditions to assess performance and scalability in large-scale indoor environments. Empirical results show MA3DSG achieving competitive accuracy with substantial gains in speed (up to 4x faster) and dramatic reductions in data traffic (up to ~98x) compared to multi-agent baselines, especially in dynamic LDCP scenarios. The work lays a foundation for scalable, real-world multi-agent 3DSGG systems and provides a practical benchmark for future research.

Abstract

Current 3D scene graph generation (3DSGG) approaches heavily rely on a single-agent assumption and small-scale environments, exhibiting limited scalability to real-world scenarios. In this work, we introduce Multi-Agent 3D Scene Graph Generation (MA3DSG) model, the first framework designed to tackle this scalability challenge using multiple agents. We develop a training-free graph alignment algorithm that efficiently merges partial query graphs from individual agents into a unified global scene graph. Leveraging extensive analysis and empirical insights, our approach enables conventional single-agent systems to operate collaboratively without requiring any learnable parameters. To rigorously evaluate 3DSGG performance, we propose MA3DSG-Bench-a benchmark that supports diverse agent configurations, domain sizes, and environmental conditions-providing a more general and extensible evaluation framework. This work lays a solid foundation for scalable, multi-agent 3DSGG research.
Paper Structure (34 sections, 2 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 34 sections, 2 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of Runtime and Data Traffic. Our MA3DSG (14.8 min, 3.7 MB) runs $4\times$ faster than single-agent system (SGFN, 61.8 min), and uses $98\times$ less data traffic than multi-agent system (SGFN + SG-PGM, 364.1 MB) in extremely large-scale environments. Unlike the single-agent baselines and MA3DSG, which were only executed on CPUs, the multi-agent baselines utilized GPUs on the backend due to their model complexity.
  • Figure 2: The overall architecture of the proposed MA3DSG. Each agent incrementally generates 3D semantic scene graphs in a large-scale environment. The framework consists of multi-agent exploration, 3D semantic scene graph generation, and graph alignment, where agents collaboratively construct and integrate local scene graphs into a unified global representation.
  • Figure 3: Unified domain evaluation. (a) Prior works treat each explored scene separately. (b) A newly annotated final 3D scene graph reflects temporal changes from randomly ordered visits for the LDCP scenario.
  • Figure 4: Qualitative results of SGFN and MA3DSG. We visualize (a) incrementally scanned point clouds, (b) ground truth instance segmentation, (c) ground truth 3D Semantic Scene Graph, (d) SGFN-generated, and (e) MA3DSG-generated 3D Semantic Scene Graphs. For the same room, the upper row shows SCP results and the lower row shows LDCP results.