Proposition of Affordance-Driven Environment Recognition Framework Using Symbol Networks in Large Language Models
Kazuma Arii, Satoshi Kurihara
TL;DR
The paper tackles the challenge of enabling robots to reason about affordances in dynamic scenes by leveraging large language models as sources of commonsense knowledge. It introduces a three-stage pipeline: generate text with an LLM, reconstruct it into a symbol network using morphological and dependency parsing, and derive affordances from network distances defined by $distance(s,e)=decay^{n}$ with $0<decay<1$ and $n$ as the composition depth, yielding $affordance(x,a)$ as the shortest path or a penalty if unreachable. The method yields context-dependent affordances, including automatic tool selection (eg, using a knife for slicing and a pencil for drawing) and environment-sensitive actions, demonstrated on an apple-centric example and evaluated against human judgments. This work contributes an interpretable bridge between symbolized LLM knowledge and robot situational understanding, offering a scalable approach to environment-aware action planning and decision making in embodied systems.
Abstract
In the quest to enable robots to coexist with humans, understanding dynamic situations and selecting appropriate actions based on common sense and affordances are essential. Conventional AI systems face challenges in applying affordance, as it represents implicit knowledge derived from common sense. However, large language models (LLMs) offer new opportunities due to their ability to process extensive human knowledge. This study proposes a method for automatic affordance acquisition by leveraging LLM outputs. The process involves generating text using LLMs, reconstructing the output into a symbol network using morphological and dependency analysis, and calculating affordances based on network distances. Experiments using ``apple'' as an example demonstrated the method's ability to extract context-dependent affordances with high explainability. The results suggest that the proposed symbol network, reconstructed from LLM outputs, enables robots to interpret affordances effectively, bridging the gap between symbolized data and human-like situational understanding.
