Table of Contents
Fetching ...

The Bare Necessities: Designing Simple, Effective Open-Vocabulary Scene Graphs

Christina Kassab, Matías Mattamala, Sacha Morin, Martin Büchner, Abhinav Valada, Liam Paull, Maurice Fallon

TL;DR

The paper analyzes 3D open-vocabulary scene graphs to identify practical bottlenecks for real-time embodied agents. Through three focused studies on image pre-processing, multi-view feature fusion, and feature selection, it reveals that costly pre-processing and naive view averaging provide little benefit, while entropy-based per-view selection yields performance gains without extra cost. These insights are integrated into a minimal, computation-balanced pipeline that matches state-of-the-art segmentation accuracy at roughly a threefold reduction in compute. The work offers concrete guidance for designing real-time open-vocabulary scene graphs and demonstrates that simpler architectures can achieve strong performance when paired with smart feature selection and efficient mapping.

Abstract

3D open-vocabulary scene graph methods are a promising map representation for embodied agents, however many current approaches are computationally expensive. In this paper, we reexamine the critical design choices established in previous works to optimize both efficiency and performance. We propose a general scene graph framework and conduct three studies that focus on image pre-processing, feature fusion, and feature selection. Our findings reveal that commonly used image pre-processing techniques provide minimal performance improvement while tripling computation (on a per object view basis). We also show that averaging feature labels across different views significantly degrades performance. We study alternative feature selection strategies that enhance performance without adding unnecessary computational costs. Based on our findings, we introduce a computationally balanced approach for 3D point cloud segmentation with per-object features. The approach matches state-of-the-art classification accuracy while achieving a threefold reduction in computation.

The Bare Necessities: Designing Simple, Effective Open-Vocabulary Scene Graphs

TL;DR

The paper analyzes 3D open-vocabulary scene graphs to identify practical bottlenecks for real-time embodied agents. Through three focused studies on image pre-processing, multi-view feature fusion, and feature selection, it reveals that costly pre-processing and naive view averaging provide little benefit, while entropy-based per-view selection yields performance gains without extra cost. These insights are integrated into a minimal, computation-balanced pipeline that matches state-of-the-art segmentation accuracy at roughly a threefold reduction in compute. The work offers concrete guidance for designing real-time open-vocabulary scene graphs and demonstrates that simpler architectures can achieve strong performance when paired with smart feature selection and efficient mapping.

Abstract

3D open-vocabulary scene graph methods are a promising map representation for embodied agents, however many current approaches are computationally expensive. In this paper, we reexamine the critical design choices established in previous works to optimize both efficiency and performance. We propose a general scene graph framework and conduct three studies that focus on image pre-processing, feature fusion, and feature selection. Our findings reveal that commonly used image pre-processing techniques provide minimal performance improvement while tripling computation (on a per object view basis). We also show that averaging feature labels across different views significantly degrades performance. We study alternative feature selection strategies that enhance performance without adding unnecessary computational costs. Based on our findings, we introduce a computationally balanced approach for 3D point cloud segmentation with per-object features. The approach matches state-of-the-art classification accuracy while achieving a threefold reduction in computation.

Paper Structure

This paper contains 24 sections, 6 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: 3D open-vocabulary segmentations and processing times from three different methods: ConceptGraphs (left), HOVSG (middle) and, our proposed system (right). In this work, we analyze open-vocabulary scene graph methods and use our findings to develop a minimal system which achieves comparable object classification accuracy at a fraction of the computational cost of existing methods. We show object labels grouped into broader categories for easy visualization, for more details see Supp. Sec \ref{['sec:final_results']}.
  • Figure 2: Typical framework used by open-vocabulary scene graph methods. The pipeline includes: 1) Input RGB-D images and poses. 2) Perform 3D instance segmentation via 2D segmentation or point cloud accumulation. 3) Select unobstructed views based on visible points. 4) Scale and fuse features or apply a SAM mask. 5) Fuse features across views for final object labeling. Highlighted sections are explored in Section \ref{['sec:study']}.
  • Figure 3: Classification accuracy using different pre-processed images, including crops and masks generated by SAM2. $\ovoid$ corresponds to ScanNet++ scenes and $\bigtriangleup$ to Replica scenes. $\medbullet$ indicates the mean. We show crops at different scale factors, crops fused over multiple scales, SAM masks with various backgrounds, and SAM masks fused with crops. The letters denote either the scale or the SAM mask: a = 1.0, b = 1.2, c = 1.5, d = 1.8, e = 2.0, sw = SAM mask with a white background, sb = SAM mask with a black ground and st = SAM mask with a transparent background. Any combinations indicate fusions. The results indicate that the various pre-processing methods do not improve performance significantly.
  • Figure 4: Example of the multi-view variance of CLIP outputs. We show various views of a lamp from ScanNet++ scene 0a7cc12c0e, with crops scaled to 1.5x of the original bounding box. The number of visible points per viewpoint is similar in each view but the output varies greatly indicating that visibility is not the optimal method for choosing "best" views.
  • Figure 5: Object classification accuracy using different multi-view feature fusion and selection strategies. $\bar{x}$, refers to averaging the multi-view features, $\bar{x}_{\footnotesize{\mathrm{H}}}$ is a weighted average where the weights are the entropy values, $\bar{x}_{\footnotesize{\mathrm{S}}}$ is the weighted average based off of the score, $x_{\footnotesize{\mathrm{min(H)}}}$ is selecting the feature with the lowest entropy, and $x_{\footnotesize{\mathrm{max(S)}}}$ is selecting the feature with the highest score.
  • ...and 9 more figures