Table of Contents
Fetching ...

CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

Jingliang Li, Jindou Jia, Tuo An, Chuhao Zhou, Xiangyu Chen, Shilin Shan, Boyu Ma, Bofan Lyu, Gen Li, Jianfei Yang

Abstract

When told to "cut the apple," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.

CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

Abstract

When told to "cut the apple," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.

Paper Structure

This paper contains 19 sections, 11 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The proposed task of Multi-Object Affordance Grounding under Intent-Driven Instructions. Given a multi-object point cloud and a natural language query describing an intended action, the goal is to predict a per-point affordance mask. The same scene can yield different targets depending on the query intent. We introduce CompassAD, a benchmark for this challenging setting.
  • Figure 2: Overview of the CompassAD benchmark. (a) Affordance concept distribution. (b) Object category distribution. (c) Hierarchy of confusing pairs grouped by target confusing affordance types. (d) Source breakdown of the collected 3D object instances. (e) Confusion matrix between affordance and object categories, highlighting the many-to-many nature of real-world affordances.
  • Figure 3: Overall architecture of CompassNet. Given 3D point clouds of a scene and a human query, Uni3D zhou2023uni3d and RoBERTa liu2019roberta are applied to produce per-point features $\bm{F}$ and text features $\bm{T}$. We then propose Instance-bounded Cross Injection (ICI), which enhances both region- and point-level representations through coarse-to-fine query interactions while preventing cross-object leakage of query semantics. To enable more accurate affordance prediction, Bi-level Contrastive Refinement (BCR) is further introduced to explicitly identify the functional regions that best match the query (TG-Softmax) and provide additional supervision for highly ambiguous point-level affordances (TP-HardNeg).
  • Figure 4: Qualitative comparison on CompassAD. Each triplet shows ground truth (GT), CompassNet (Ours), and GLANCE (SOTA). Left: the same scene queried with different intents activates different objects/regions (chair seat vs. bed surface), illustrating query-dependent disambiguation. Right: diverse confusing pairs (knife vs. scissors, skateboard vs. surfboard, kettle vs. cup). Red denotes higher affordance probability.
  • Figure 5: Real-world robotic grasping in confusing multi-object scenes. Each row shows a different scenario. Real-world scene containing confusing objects and distractors. Affordance prediction from CompassNet on the reconstructed point cloud (red = high probability). Robotic grasp execution based on the predicted affordance. Top: given a cutting-related query, the model correctly identifies the knife over scissors. Bottom: given a hammering-related query, the model correctly identifies the hammer over distractors.