Table of Contents
Fetching ...

Building the Web for Agents: A Declarative Framework for Agent-Web Interaction

Sven Schultze, Meike Verena Kietzmann, Nils-Lucas Schönfeld, Ruth Stock-Homburg

TL;DR

Current AI agents on the web face brittle, unsafe, and privacy-sensitive interactions due to reliance on human-oriented UIs. VOIX introduces a web-native, declarative framework using <tool> and <context> to expose machine-readable capabilities, distributed across a Website, Browser Agent, and Inference Provider to preserve user privacy. A Chrome-based reference implementation and a three-day hackathon with 16 developers demonstrate VOIX’s learnability and expressive power, including high-level multimodal interactions and dynamic scoping. Latency benchmarks indicate VOIX offers faster, more reliable interactions than inference-based, vision-driven approaches, supporting a practical, decentralized Agentic Web with strong safety and controllability for developers and users.

Abstract

The increasing deployment of autonomous AI agents on the web is hampered by a fundamental misalignment: agents must infer affordances from human-oriented user interfaces, leading to brittle, inefficient, and insecure interactions. To address this, we introduce VOIX, a web-native framework that enables websites to expose reliable, auditable, and privacy-preserving capabilities for AI agents through simple, declarative HTML elements. VOIX introduces <tool> and <context> tags, allowing developers to explicitly define available actions and relevant state, thereby creating a clear, machine-readable contract for agent behavior. This approach shifts control to the website developer while preserving user privacy by disconnecting the conversational interactions from the website. We evaluated the framework's practicality, learnability, and expressiveness in a three-day hackathon study with 16 developers. The results demonstrate that participants, regardless of prior experience, were able to rapidly build diverse and functional agent-enabled web applications. Ultimately, this work provides a foundational mechanism for realizing the Agentic Web, enabling a future of seamless and secure human-AI collaboration on the web.

Building the Web for Agents: A Declarative Framework for Agent-Web Interaction

TL;DR

Current AI agents on the web face brittle, unsafe, and privacy-sensitive interactions due to reliance on human-oriented UIs. VOIX introduces a web-native, declarative framework using <tool> and <context> to expose machine-readable capabilities, distributed across a Website, Browser Agent, and Inference Provider to preserve user privacy. A Chrome-based reference implementation and a three-day hackathon with 16 developers demonstrate VOIX’s learnability and expressive power, including high-level multimodal interactions and dynamic scoping. Latency benchmarks indicate VOIX offers faster, more reliable interactions than inference-based, vision-driven approaches, supporting a practical, decentralized Agentic Web with strong safety and controllability for developers and users.

Abstract

The increasing deployment of autonomous AI agents on the web is hampered by a fundamental misalignment: agents must infer affordances from human-oriented user interfaces, leading to brittle, inefficient, and insecure interactions. To address this, we introduce VOIX, a web-native framework that enables websites to expose reliable, auditable, and privacy-preserving capabilities for AI agents through simple, declarative HTML elements. VOIX introduces <tool> and <context> tags, allowing developers to explicitly define available actions and relevant state, thereby creating a clear, machine-readable contract for agent behavior. This approach shifts control to the website developer while preserving user privacy by disconnecting the conversational interactions from the website. We evaluated the framework's practicality, learnability, and expressiveness in a three-day hackathon study with 16 developers. The results demonstrate that participants, regardless of prior experience, were able to rapidly build diverse and functional agent-enabled web applications. Ultimately, this work provides a foundational mechanism for realizing the Agentic Web, enabling a future of seamless and secure human-AI collaboration on the web.

Paper Structure

This paper contains 28 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Integration of VOIX into a task management web application. (a) The human-facing interface allows users to add, complete, and clear tasks. (b) The application embeds VOIX-HTML elements in its markup, declaratively exposing state and invokable actions to match or complement these features. (c) The VOIX reference Chrome extension automatically discovers the declarations and surfaces them in a side panel, where an LLM-powered agent can reason over available tools and execute actions based on voice or chat user input.
  • Figure 2: The graphic design application demonstrates synergistic multimodal interaction using VOIX: dynamic context elements contain information about the objects on the canvas and their state, allowing the Agent to understand which objects to change how in order to implement the users instructions. Then, a broad set of tools enables the LLM to create, edit, rearrange and delete objects.