Table of Contents
Fetching ...

Proposition of Affordance-Driven Environment Recognition Framework Using Symbol Networks in Large Language Models

Kazuma Arii, Satoshi Kurihara

TL;DR

The paper tackles the challenge of enabling robots to reason about affordances in dynamic scenes by leveraging large language models as sources of commonsense knowledge. It introduces a three-stage pipeline: generate text with an LLM, reconstruct it into a symbol network using morphological and dependency parsing, and derive affordances from network distances defined by $distance(s,e)=decay^{n}$ with $0<decay<1$ and $n$ as the composition depth, yielding $affordance(x,a)$ as the shortest path or a penalty if unreachable. The method yields context-dependent affordances, including automatic tool selection (eg, using a knife for slicing and a pencil for drawing) and environment-sensitive actions, demonstrated on an apple-centric example and evaluated against human judgments. This work contributes an interpretable bridge between symbolized LLM knowledge and robot situational understanding, offering a scalable approach to environment-aware action planning and decision making in embodied systems.

Abstract

In the quest to enable robots to coexist with humans, understanding dynamic situations and selecting appropriate actions based on common sense and affordances are essential. Conventional AI systems face challenges in applying affordance, as it represents implicit knowledge derived from common sense. However, large language models (LLMs) offer new opportunities due to their ability to process extensive human knowledge. This study proposes a method for automatic affordance acquisition by leveraging LLM outputs. The process involves generating text using LLMs, reconstructing the output into a symbol network using morphological and dependency analysis, and calculating affordances based on network distances. Experiments using ``apple'' as an example demonstrated the method's ability to extract context-dependent affordances with high explainability. The results suggest that the proposed symbol network, reconstructed from LLM outputs, enables robots to interpret affordances effectively, bridging the gap between symbolized data and human-like situational understanding.

Proposition of Affordance-Driven Environment Recognition Framework Using Symbol Networks in Large Language Models

TL;DR

The paper tackles the challenge of enabling robots to reason about affordances in dynamic scenes by leveraging large language models as sources of commonsense knowledge. It introduces a three-stage pipeline: generate text with an LLM, reconstruct it into a symbol network using morphological and dependency parsing, and derive affordances from network distances defined by with and as the composition depth, yielding as the shortest path or a penalty if unreachable. The method yields context-dependent affordances, including automatic tool selection (eg, using a knife for slicing and a pencil for drawing) and environment-sensitive actions, demonstrated on an apple-centric example and evaluated against human judgments. This work contributes an interpretable bridge between symbolized LLM knowledge and robot situational understanding, offering a scalable approach to environment-aware action planning and decision making in embodied systems.

Abstract

In the quest to enable robots to coexist with humans, understanding dynamic situations and selecting appropriate actions based on common sense and affordances are essential. Conventional AI systems face challenges in applying affordance, as it represents implicit knowledge derived from common sense. However, large language models (LLMs) offer new opportunities due to their ability to process extensive human knowledge. This study proposes a method for automatic affordance acquisition by leveraging LLM outputs. The process involves generating text using LLMs, reconstructing the output into a symbol network using morphological and dependency analysis, and calculating affordances based on network distances. Experiments using ``apple'' as an example demonstrated the method's ability to extract context-dependent affordances with high explainability. The results suggest that the proposed symbol network, reconstructed from LLM outputs, enables robots to interpret affordances effectively, bridging the gap between symbolized data and human-like situational understanding.

Paper Structure

This paper contains 16 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: How to compose the nodes of the knowledge graph from sentences. The nodes surrounded in boxes are Origin Nodes (formal nodes: nodes for composition of other nodes.)
  • Figure 2: An example of edge composition for network construction from a sentence (whose dependencies are analyzed by CoreNLP). If a noun has any modifier, the original noun and the modified noun are constructed as separate nodes. The nodes that make up the object are connected to its Action Node, and the nodes that make up the modifier district are connected to its Attribute Nodes.
  • Figure 3: A part of the explored network when "apple" and "pencil" are observed. The distances are omitted. As shown, the actions using the pen as a tool are reached via "with pencil" node. It is possible to interpret this as an automatic selection of what will become a tool in the search process.