Table of Contents
Fetching ...

FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning

Zhengyu Fu, René Zurbrügg, Kaixian Qu, Marc Pollefeys, Marco Hutter, Hermann Blum, Zuria Bauer

Abstract

Recent work in 3D scene understanding is moving beyond purely spatial analysis toward functional scene understanding. However, existing methods often consider functional relationships between object pairs in isolation, failing to capture the scene-wide interdependence that humans use to resolve ambiguity. We introduce FunFact, a framework for constructing probabilistic open-vocabulary functional 3D scene graphs from posed RGB-D images. FunFact first builds an object- and part-centric 3D map and uses foundation models to propose semantically plausible functional relations. These candidates are converted into factor graph variables and constrained by both LLM-derived common-sense priors and geometric priors. This formulation enables joint probabilistic inference over all functional edges and their marginals, yielding substantially better calibrated confidence scores. To benchmark this setting, we introduce FunThor, a synthetic dataset based on AI2-THOR with part-level geometry and rule-based functional annotations. Experiments on SceneFun3D, FunGraph3D, and FunThor show that FunFact improves node and relation discovery recall and significantly reduces calibration error for ambiguous relations, highlighting the benefits of holistic probabilistic modeling for functional scene understanding. See our project page at https://funfact-scenegraph.github.io/

FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning

Abstract

Recent work in 3D scene understanding is moving beyond purely spatial analysis toward functional scene understanding. However, existing methods often consider functional relationships between object pairs in isolation, failing to capture the scene-wide interdependence that humans use to resolve ambiguity. We introduce FunFact, a framework for constructing probabilistic open-vocabulary functional 3D scene graphs from posed RGB-D images. FunFact first builds an object- and part-centric 3D map and uses foundation models to propose semantically plausible functional relations. These candidates are converted into factor graph variables and constrained by both LLM-derived common-sense priors and geometric priors. This formulation enables joint probabilistic inference over all functional edges and their marginals, yielding substantially better calibrated confidence scores. To benchmark this setting, we introduce FunThor, a synthetic dataset based on AI2-THOR with part-level geometry and rule-based functional annotations. Experiments on SceneFun3D, FunGraph3D, and FunThor show that FunFact improves node and relation discovery recall and significantly reduces calibration error for ambiguous relations, highlighting the benefits of holistic probabilistic modeling for functional scene understanding. See our project page at https://funfact-scenegraph.github.io/

Paper Structure

This paper contains 46 sections, 3 equations, 15 figures, 8 tables, 1 algorithm.

Figures (15)

  • Figure 1: FunFact for functional scene understanding. Given posed RGB-D inputs, FunFact reconstructs an object- and part-centric 3D map and builds a functional scene graph (top). Candidate relations (e.g., remote controls TV, switch toggles lamp) are encoded as binary variables in a dual factor graph (bottom), where cardinality and proximity factors jointly resolve ambiguities via belief propagation, yielding calibrated per-edge confidence scores.
  • Figure 2: Overview of FunFact: Given Posed RGB-D images FunFact generates scene reconstructions and functional 3D scene graphs in two stages: i) Scene Reconstruction. Given a set of RGB-D images and respective poses, we extract bounding boxes, scene description, object list, and candidate part names using GPT-4.1. These textual cues are used to obtain open-vocabulary object detections with GroundingDINO, which are filtered for consistency and turned into region proposals and SAM-based instance masks. A second GroundingDINO + SAM branch segments functional parts conditioned on the predicted object and part names. Finally, we lift all object and part instances to 3D and fuse them across views yielding a coherent, part-aware 3D reconstruction that forms the basis for the subsequent functional scene graph inference. ii) Functional Scene-Graph Creation. Given the semantic 3D reconstruction and the part / object nodes, GPT-4.1 proposes object–object and object–part relations to form an initial functional scene graph. We convert this graph into a dual factor graph with different priors and perform belief propagation to obtain calibrated edge probabilities. This yields the posterior functional scene graph grounded in the reconstructed scene.
  • Figure 3: Functional Scene Graph and its Dual Factor Graph.Left: Candidate functional scene graph with edges $e_1,\dots,e_4$ representing knob–burner relations. Right: The dual factor graph, where each binary variable $x_i$is the dual of edge $e_i$, encoding whether that relation is present. $p_i$: unary proximity prior on $x_i$; $K_i$: cardinality factor enforcing one-to-one association per knob; $B_i$: cardinality factor enforcing one-to-one association per burner.
  • Figure 4: Examples of two of our newly annotated AI2-THOR environments. For two scenes (bathroom, top; kitchen, bottom) we show a top-down view with mapped functional objects (left) and the corresponding instance-, part-segmentation and functional edges (right), illustrating our functional annotations.
  • Figure 5: Qualitative Results.Top: Input images with detected functional objects. Bottom-left: Reconstructed object and part point clouds with predicted functional relations (red: confidence $<$ 0.5; yellow: confidence $\geq$ 0.5). Bottom-right: Final functional 3D scene graph after confidence thresholding; red edges indicate object-part hierarchy and gray edges indicate functional relations.
  • ...and 10 more figures