Table of Contents
Fetching ...

SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes

Léopold Maillard, Francis Engelmann, Tom Durand, Boxiao Pan, Yang You, Or Litany, Leonidas Guibas, Maks Ovsjanikov

Abstract

Embodied AI depends on interactive 3D environments that support meaningful activities for diverse users, yet assessing their functional affordances remains a core challenge. We introduce SceneTeract, a framework that verifies 3D scene functionality under agent-specific constraints. Our core contribution is a grounded verification engine that couples high-level semantic reasoning with low-level geometric checks. SceneTeract decomposes complex activities into sequences of atomic actions and validates each step against accessibility requirements (e.g., reachability, clearance, and navigability) conditioned on an embodied agent profile, using explicit physical and geometric simulations. We deploy SceneTeract to perform an in-depth evaluation of (i) synthetic indoor environments, uncovering frequent functional failures that prevent basic interactions, and (ii) the ability of frontier Vision-Language Models (VLMs) to reason about and predict functional affordances, revealing systematic mismatches between semantic confidence and physical feasibility even for the strongest current models. Finally, we leverage SceneTeract as a reward engine for VLM post-training, enabling scalable distillation of geometric constraints into reasoning models. We release the SceneTeract verification suite and data to bridge perception and physical reality in embodied 3D scene understanding.

SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes

Abstract

Embodied AI depends on interactive 3D environments that support meaningful activities for diverse users, yet assessing their functional affordances remains a core challenge. We introduce SceneTeract, a framework that verifies 3D scene functionality under agent-specific constraints. Our core contribution is a grounded verification engine that couples high-level semantic reasoning with low-level geometric checks. SceneTeract decomposes complex activities into sequences of atomic actions and validates each step against accessibility requirements (e.g., reachability, clearance, and navigability) conditioned on an embodied agent profile, using explicit physical and geometric simulations. We deploy SceneTeract to perform an in-depth evaluation of (i) synthetic indoor environments, uncovering frequent functional failures that prevent basic interactions, and (ii) the ability of frontier Vision-Language Models (VLMs) to reason about and predict functional affordances, revealing systematic mismatches between semantic confidence and physical feasibility even for the strongest current models. Finally, we leverage SceneTeract as a reward engine for VLM post-training, enabling scalable distillation of geometric constraints into reasoning models. We release the SceneTeract verification suite and data to bridge perception and physical reality in embodied 3D scene understanding.

Paper Structure

This paper contains 96 sections, 7 equations, 13 figures, 7 tables, 7 algorithms.

Figures (13)

  • Figure 1: We introduce SceneTeract, a verification engine that, given a 3D scene, an embodied agent profile, and a target activity, decomposes the task into atomic actions and validates each step with explicit geometric and physical checks, producing fine-grained feasibility diagnostics. We deploy this pipeline for three applications: auditing synthetic 3D scenes, benchmarking frontier VLMs on embodiment-aware functional reasoning, and providing reward signals for post-training a VLM in order to improve its reasoning capabilities.
  • Figure 2: Overview of the SceneTeract verification framework. Given an input 3D scene $(\mathcal{S})$, embodied agent profile $(\mathcal{A})$, and target activity $(\mathcal{T})$, a VLM planner $(\Phi)$ decomposes the activity into atomic actions, which are then checked by a 3D verifier $(\Psi)$ for geometric feasibility assessment. The pipeline returns a fine-grained affordance report $(\mathcal{R})$ with step-level pass/fail diagnostics. This enables multiple downstream applications (Section \ref{['sec:method-application']}), including scene auditing, VLM benchmarking, and training-time reward supervision.
  • Figure 3: Overall task $\mathcal{T}$ and boolean properties $\mathcal{P}$ success rates per agent profile on the 3D-FRONT 3dfront dataset.
  • Figure 4: Functional patterns and failure modes detected by SceneTeract. (a)navigation maps highlighting restricted areas (yellow) disconnected from the main area (green) for agents with larger occupancy radii (impacting the red border); (b) a blocked hallway prevents access to the target cabinet, even though its local interaction clearance (green box) is valid; (c) insufficient space (red box) to articulate the object; (d) although the object is partially reachable, the identified interaction area (orange) to sit on the couch is obstructed; (e) the identified interactive space (pink) to open the cabinet is within the agent's reach from a valid interaction zone (blue).
  • Figure 5: Main components of the SceneTeract framework implementation. The system is organized around four interacting modules: a Scene Manager (which maintains structured scene state and object properties), a Scene Renderer (which produces visual observations for prompting), a VLM Client (which instantiates the Semantic Planner), and a Geometric Verifier (which executes grounded feasibility checks). The modules communicate via the shared input triplet $\mathcal{I}=\langle \mathcal{S},\mathcal{T},\mathcal{A}\rangle$ and the action-to-property mapping $\mathrm{Req}(a)$ that links each atomic action to the relevant geometric checks.
  • ...and 8 more figures