Table of Contents
Fetching ...

Affordance Representation and Recognition for Autonomous Agents

Habtom Kahsay Gidey, Niklas Huber, Alexander Lenz, Alois Knoll

TL;DR

This work tackles how autonomous agents can build actionable world models from structured data by addressing DOM verbosity and brittle service integrations. It introduces two architectural patterns—the DOM Transduction Pattern for distilling complex webpages into a compact Page Affordance Model and the Hypermedia Affordances Recognition Pattern for runtime discovery via WoT Thing Descriptions—that jointly enable scalable, adaptive perception and interoperation. Together, they drive the construction of a Cognitive Map that unifies structured page data and dynamic service capabilities, enabling more efficient, resilient, and predictive automation on the web. The paper lays a principled pattern-language foundation with concrete design constraints and outlines a path toward future multimodal perception patterns.

Abstract

The autonomy of software agents is fundamentally dependent on their ability to construct an actionable internal world model from the structured data that defines their digital environment, such as the Document Object Model (DOM) of web pages and the semantic descriptions of web services. However, constructing this world model from raw structured data presents two critical challenges: the verbosity of raw HTML makes it computationally intractable for direct use by foundation models, while the static nature of hardcoded API integrations prevents agents from adapting to evolving services. This paper introduces a pattern language for world modeling from structured data, presenting two complementary architectural patterns. The DOM Transduction Pattern addresses the challenge of web page complexity by distilling} a verbose, raw DOM into a compact, task-relevant representation or world model optimized for an agent's reasoning core. Concurrently, the Hypermedia Affordances Recognition Pattern enables the agent to dynamically enrich its world model by parsing standardized semantic descriptions to discover and integrate the capabilities of unknown web services at runtime. Together, these patterns provide a robust framework for engineering agents that can efficiently construct and maintain an accurate world model, enabling scalable, adaptive, and interoperable automation across the web and its extended resources.

Affordance Representation and Recognition for Autonomous Agents

TL;DR

This work tackles how autonomous agents can build actionable world models from structured data by addressing DOM verbosity and brittle service integrations. It introduces two architectural patterns—the DOM Transduction Pattern for distilling complex webpages into a compact Page Affordance Model and the Hypermedia Affordances Recognition Pattern for runtime discovery via WoT Thing Descriptions—that jointly enable scalable, adaptive perception and interoperation. Together, they drive the construction of a Cognitive Map that unifies structured page data and dynamic service capabilities, enabling more efficient, resilient, and predictive automation on the web. The paper lays a principled pattern-language foundation with concrete design constraints and outlines a path toward future multimodal perception patterns.

Abstract

The autonomy of software agents is fundamentally dependent on their ability to construct an actionable internal world model from the structured data that defines their digital environment, such as the Document Object Model (DOM) of web pages and the semantic descriptions of web services. However, constructing this world model from raw structured data presents two critical challenges: the verbosity of raw HTML makes it computationally intractable for direct use by foundation models, while the static nature of hardcoded API integrations prevents agents from adapting to evolving services. This paper introduces a pattern language for world modeling from structured data, presenting two complementary architectural patterns. The DOM Transduction Pattern addresses the challenge of web page complexity by distilling} a verbose, raw DOM into a compact, task-relevant representation or world model optimized for an agent's reasoning core. Concurrently, the Hypermedia Affordances Recognition Pattern enables the agent to dynamically enrich its world model by parsing standardized semantic descriptions to discover and integrate the capabilities of unknown web services at runtime. Together, these patterns provide a robust framework for engineering agents that can efficiently construct and maintain an accurate world model, enabling scalable, adaptive, and interoperable automation across the web and its extended resources.

Paper Structure

This paper contains 17 sections, 3 figures.

Figures (3)

  • Figure 1: Flow and components of the DOM Transduction Pattern. Raw DOM is distilled into an affordance representation, the Page Affordance Model (PAM), and then fused with other structured percepts, such as service contracts and WoT Things, to update the Cognitive Map, the agent’s world model.
  • Figure 2: The flow of the Hypermedia Affordances Recognition Pattern involves the agent discovering a semantic description and parsing it into an affordances catalog which updates the Cognitive Map.
  • Figure 3: High-level flow from structured environments (web services, devices) to a unified cognitive map. Vision is excluded; only structured inputs are considered.