The Bare Necessities: Designing Simple, Effective Open-Vocabulary Scene Graphs

Christina Kassab; Matías Mattamala; Sacha Morin; Martin Büchner; Abhinav Valada; Liam Paull; Maurice Fallon

The Bare Necessities: Designing Simple, Effective Open-Vocabulary Scene Graphs

Christina Kassab, Matías Mattamala, Sacha Morin, Martin Büchner, Abhinav Valada, Liam Paull, Maurice Fallon

TL;DR

The paper analyzes 3D open-vocabulary scene graphs to identify practical bottlenecks for real-time embodied agents. Through three focused studies on image pre-processing, multi-view feature fusion, and feature selection, it reveals that costly pre-processing and naive view averaging provide little benefit, while entropy-based per-view selection yields performance gains without extra cost. These insights are integrated into a minimal, computation-balanced pipeline that matches state-of-the-art segmentation accuracy at roughly a threefold reduction in compute. The work offers concrete guidance for designing real-time open-vocabulary scene graphs and demonstrates that simpler architectures can achieve strong performance when paired with smart feature selection and efficient mapping.

Abstract

3D open-vocabulary scene graph methods are a promising map representation for embodied agents, however many current approaches are computationally expensive. In this paper, we reexamine the critical design choices established in previous works to optimize both efficiency and performance. We propose a general scene graph framework and conduct three studies that focus on image pre-processing, feature fusion, and feature selection. Our findings reveal that commonly used image pre-processing techniques provide minimal performance improvement while tripling computation (on a per object view basis). We also show that averaging feature labels across different views significantly degrades performance. We study alternative feature selection strategies that enhance performance without adding unnecessary computational costs. Based on our findings, we introduce a computationally balanced approach for 3D point cloud segmentation with per-object features. The approach matches state-of-the-art classification accuracy while achieving a threefold reduction in computation.

The Bare Necessities: Designing Simple, Effective Open-Vocabulary Scene Graphs

TL;DR

Abstract

The Bare Necessities: Designing Simple, Effective Open-Vocabulary Scene Graphs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)