Table of Contents
Fetching ...

Bayesian Fields: Task-driven Open-Set Semantic Gaussian Splatting

Dominic Maggio, Luca Carlone

TL;DR

Bayesian Fields tackles open-set semantic mapping by making task-driven granularity explicit and by fusing semantic observations across views with a probabilistic Bayesian framework. It grounds CLIP-based relevance in a probabilistic model, uses a measurement model to account for view-dependent noise, and clusters 3D Gaussians into task-relevant objects via the Information Bottleneck, all while using a memory-efficient Gaussian Splat representation. The approach yields accurate, task-focused object extraction with substantial memory savings and fast runtimes, without requiring task-specific training. This provides a practical, scalable pipeline for open-set semantic mapping in complex scenes with multiplier viewpoints.

Abstract

Open-set semantic mapping requires (i) determining the correct granularity to represent the scene (e.g., how should objects be defined), and (ii) fusing semantic knowledge across multiple 2D observations into an overall 3D reconstruction -ideally with a high-fidelity yet low-memory footprint. While most related works bypass the first issue by grouping together primitives with similar semantics (according to some manually tuned threshold), we recognize that the object granularity is task-dependent, and develop a task-driven semantic mapping approach. To address the second issue, current practice is to average visual embedding vectors over multiple views. Instead, we show the benefits of using a probabilistic approach based on the properties of the underlying visual-language foundation model, and leveraging Bayesian updating to aggregate multiple observations of the scene. The result is Bayesian Fields, a task-driven and probabilistic approach for open-set semantic mapping. To enable high-fidelity objects and a dense scene representation, Bayesian Fields uses 3D Gaussians which we cluster into task-relevant objects, allowing for both easy 3D object extraction and reduced memory usage. We release Bayesian Fields open-source at https: //github.com/MIT-SPARK/Bayesian-Fields.

Bayesian Fields: Task-driven Open-Set Semantic Gaussian Splatting

TL;DR

Bayesian Fields tackles open-set semantic mapping by making task-driven granularity explicit and by fusing semantic observations across views with a probabilistic Bayesian framework. It grounds CLIP-based relevance in a probabilistic model, uses a measurement model to account for view-dependent noise, and clusters 3D Gaussians into task-relevant objects via the Information Bottleneck, all while using a memory-efficient Gaussian Splat representation. The approach yields accurate, task-focused object extraction with substantial memory savings and fast runtimes, without requiring task-specific training. This provides a practical, scalable pipeline for open-set semantic mapping in complex scenes with multiplier viewpoints.

Abstract

Open-set semantic mapping requires (i) determining the correct granularity to represent the scene (e.g., how should objects be defined), and (ii) fusing semantic knowledge across multiple 2D observations into an overall 3D reconstruction -ideally with a high-fidelity yet low-memory footprint. While most related works bypass the first issue by grouping together primitives with similar semantics (according to some manually tuned threshold), we recognize that the object granularity is task-dependent, and develop a task-driven semantic mapping approach. To address the second issue, current practice is to average visual embedding vectors over multiple views. Instead, we show the benefits of using a probabilistic approach based on the properties of the underlying visual-language foundation model, and leveraging Bayesian updating to aggregate multiple observations of the scene. The result is Bayesian Fields, a task-driven and probabilistic approach for open-set semantic mapping. To enable high-fidelity objects and a dense scene representation, Bayesian Fields uses 3D Gaussians which we cluster into task-relevant objects, allowing for both easy 3D object extraction and reduced memory usage. We release Bayesian Fields open-source at https: //github.com/MIT-SPARK/Bayesian-Fields.

Paper Structure

This paper contains 19 sections, 6 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Bayesian Fields takes a collection of posed RGB images and a task list and extracts objects at the right granularity. Given a set of tasks in natural language such as "move hats and gloves" (B2), Bayesian Fields considers a pile of clothing collectively as one object in its map. Providing more specific tasks such as "find beanies" and "get buckle" (A2, A4) results in a more fine-grained scene representation at the correct granularity to support the tasks. We also detect different instances of objects relevant for the same task. For example, there are two separate clusters of cleaning products corresponding to "get cleaning products" (A5), since these objects are spatially separated in the scene. Objects shown are the 3D extracted objects that Bayesian Fields estimates are the most relevant for each task.
  • Figure 2: Overview of Bayesian Fields. Our system takes in a trained 3D Gaussian Splat, RGB images with poses, and a list of tasks. We compute cosine scores from CLIP embeddings of masked image segments and the task list and convert them to probabilities which are aggregated across multiple observations using our measurement model and Bayesian updating. Gaussians are grouped into 3D primitives and then clustered into task-relevant objects. Point Cloud filtering (kNN) is applied as a final step to remove potential floaters.
  • Figure 3: Gaussian behavior of cosine scores for CLIP image and text embeddings.
  • Figure 4: Demonstration of three types of error in cosine score between text and multiple masked frames of the object --- an Expo marker for (a) and a tape measure for (b) and (c). The object is masked with SAM2 Ravi24arxiv-sam2 for accurate tracking across frames.
  • Figure 5: Clio Apartment dataset results of final task-relevant objects. Left: RGB rendering, Right: semantic rendering with select extracted 3D objects
  • ...and 8 more figures