Table of Contents
Fetching ...

UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

Jiaying Lin, Dan Xu

Abstract

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.

UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

Abstract

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.
Paper Structure (21 sections, 9 equations, 9 figures, 6 tables)

This paper contains 21 sections, 9 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Overview of UniFunc3D compared to existing fragmented pipelines. (Top) Prior methods like Fun3DU rely on a visually blind text-only LLM for initial task parsing. Coupled with single-scale passive heuristic frame selection, this fragmented approach suffers from three critical failure modes: semantic misinterpretations (Task A$\rightarrow$ Task B), spatial-temporal context inconsistencies (a, b) and imperceptible small target (c), leading to error in the final output. (Bottom) Our proposed UniFunc3D addresses these limitations by utilizing a unified Multimodal Large Language Model (MLLM) as an active observer. By employing a coarse-to-fine active spatial-temporal grounding strategy alongside visual mask verification, UniFunc3D consolidates semantic, temporal, and spatial reasoning into a single forward pass. This allows the model to accurately ground implicit targets and generate precise fine-grained 3D functional masks while preserving necessary global context.
  • Figure 2: Method overview. UniFunc3D employs a unified MLLM with active spatial-temporal grounding with joint functional object indentification: the coarse stage (Round 1) actively surveys low-resolution video frames across multiple sampling iterations and selects the most informative candidate via visual verification; the fine stage (Round 2) processes a dense temporal window at native high resolution, delivering zoom-in capability while preserving global scene context for precise localization. Visual mask generation and verification uses SAM3 for segmentation with MLLM-based mask verification, then multi-view 3D lifting to obtain the final 3D masks.
  • Figure 3: Qualitative comparison. We show results for five representative queries (columns) across four methods (rows). GT denotes ground truth.
  • Figure 4: Visual comparison. Input prompt: open the right door of the wooden display cabinet to the left of the paintings.
  • Figure 5: Visual comparisons for ablation study. From left to right, top to bottom: Base, w/o Multi-view, w/o Temporal, w/o Verification, Ours, and GT. Two different viewpoints are shown for the same test example. Input prompt: Open the top left drawer of the cabinet with the TV on top.
  • ...and 4 more figures