Table of Contents
Fetching ...

CRAFT-E: A Neuro-Symbolic Framework for Embodied Affordance Grounding

Zhou Chen, Joe Lin, Carson Bulgin, Sathyanarayanan N. Aakur

TL;DR

CRAFT-E presents a modular neuro-symbolic framework for embodied affordance grounding that couples a verb–property–object knowledge base with CLIP-based visual grounding and energy-based grasp reasoning. By explicitly modeling functional relationships and incorporating grasp feasibility, it offers interpretable, open-world object grounding for assistive robotics. Extensive static, real-world, and ImageNet-based evaluations show competitive performance with strong transparency and robustness to perceptual noise. The work advances trustworthy robotic decision-making by exposing grounding paths and enabling mixed-initiative human-robot interaction.

Abstract

Assistive robots operating in unstructured environments must understand not only what objects are, but what they can be used for. This requires grounding language-based action queries to objects that both afford the requested function and can be physically retrieved. Existing approaches often rely on black-box models or fixed affordance labels, limiting transparency, controllability, and reliability for human-facing applications. We introduce CRAFT-E, a modular neuro-symbolic framework that composes a structured verb-property-object knowledge graph with visual-language alignment and energy-based grasp reasoning. The system generates interpretable grounding paths that expose the factors influencing object selection and incorporates grasp feasibility as an integral part of affordance inference. We further construct a benchmark dataset with unified annotations for verb-object compatibility, segmentation, and grasp candidates, and deploy the full pipeline on a physical robot. CRAFT-E achieves competitive performance in static scenes, ImageNet-based functional retrieval, and real-world trials involving 20 verbs and 39 objects. The framework remains robust under perceptual noise and provides transparent, component-level diagnostics. By coupling symbolic reasoning with embodied perception, CRAFT-E offers an interpretable and customizable alternative to end-to-end models for affordance-grounded object selection, supporting trustworthy decision-making in assistive robotic systems.

CRAFT-E: A Neuro-Symbolic Framework for Embodied Affordance Grounding

TL;DR

CRAFT-E presents a modular neuro-symbolic framework for embodied affordance grounding that couples a verb–property–object knowledge base with CLIP-based visual grounding and energy-based grasp reasoning. By explicitly modeling functional relationships and incorporating grasp feasibility, it offers interpretable, open-world object grounding for assistive robotics. Extensive static, real-world, and ImageNet-based evaluations show competitive performance with strong transparency and robustness to perceptual noise. The work advances trustworthy robotic decision-making by exposing grounding paths and enabling mixed-initiative human-robot interaction.

Abstract

Assistive robots operating in unstructured environments must understand not only what objects are, but what they can be used for. This requires grounding language-based action queries to objects that both afford the requested function and can be physically retrieved. Existing approaches often rely on black-box models or fixed affordance labels, limiting transparency, controllability, and reliability for human-facing applications. We introduce CRAFT-E, a modular neuro-symbolic framework that composes a structured verb-property-object knowledge graph with visual-language alignment and energy-based grasp reasoning. The system generates interpretable grounding paths that expose the factors influencing object selection and incorporates grasp feasibility as an integral part of affordance inference. We further construct a benchmark dataset with unified annotations for verb-object compatibility, segmentation, and grasp candidates, and deploy the full pipeline on a physical robot. CRAFT-E achieves competitive performance in static scenes, ImageNet-based functional retrieval, and real-world trials involving 20 verbs and 39 objects. The framework remains robust under perceptual noise and provides transparent, component-level diagnostics. By coupling symbolic reasoning with embodied perception, CRAFT-E offers an interpretable and customizable alternative to end-to-end models for affordance-grounded object selection, supporting trustworthy decision-making in assistive robotic systems.

Paper Structure

This paper contains 16 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the CRAFT-E framework. Given an input scene and a verb query (e.g., “write”), CRAFT-E predicts graspable regions, generates affordance-centric subgraphs via large language models and a knowledge base, and grounds the verb query to a functionally appropriate object using CLIP-based matching. The selected object can optionally be passed to a robotic actuation module for physical retrieval or delivery.
  • Figure 2: Comparison of affordance reasoning via an LLM-derived knowledge base (left) vs. ConceptNet-derived subgraphs (right). Our LLM-derived knowledge base induces edges from verbs to objects via interpretable properties to support compositional generalization and functional alignment. In contrast, ConceptNet relies on general-purpose relations such as UsedFor and RelatedTo, yielding noisy or semantically diffuse paths. Best-case ConceptNet path (lower left) lacks explanatory structure.
  • Figure 3: Qualitative results showing CRAFT-E's modular pipeline across three verb queries: "write", "pick up", and "construct". Columns show: (1) input scene, (2) grasp prediction, (3) region proposals (green), and (4) final grounding (blue). Top: Successful prediction. Middle: Grasp failure: the correct object is excluded from reasoning due to grasp infeasibility. Bottom: Grounding failure: a graspable, segmented but functionally incorrect object is selected.