Table of Contents
Fetching ...

A Case for Declarative LLM-friendly Interfaces for Improved Efficiency of Computer-Use Agents

Yuan Wang, Mingyu Li, Haibo Chen

TL;DR

The paper addresses the inefficiency of LLM-powered computer-use agents interacting with GUIs by introducing GOI, a declarative interface that separates policy from mechanism via three primitives: Access, State, and Observation. GOI builds a path-unambiguous navigation topology and uses context-efficient descriptions to enable LLMs to declare desired outcomes while GOI deterministically handles navigation and interactions. Across Microsoft Office apps, GOI significantly improves task success rates (up to 1.67x) and reduces interaction steps (up to 43.5%), with many tasks completed in a single LLM call, demonstrating robust, scalable GUI automation without API access. The work highlights practical benefits, cross-OS considerations, and future directions for declarative interfaces tailored to LLM capabilities in real-world productivity software.

Abstract

Computer-use agents (CUAs) powered by large language models (LLMs) have emerged as a promising approach to automating computer tasks, yet they struggle with graphical user interfaces (GUIs). GUIs, designed for humans, force LLMs to decompose high-level goals into lengthy, error-prone sequences of fine-grained actions, resulting in low success rates and an excessive number of LLM calls. We propose Goal-Oriented Interface (GOI), a novel abstraction that transforms existing GUIs into three declarative primitives: access, state, and observation, which are better suited for LLMs. Our key idea is policy-mechanism separation: LLMs focus on high-level semantic planning (policy) while GOI handles low-level navigation and interaction (mechanism). GOI does not require modifying the application source code or relying on application programming interfaces (APIs). We evaluate GOI with Microsoft Office Suite (Word, PowerPoint, Excel) on Windows. Compared to a leading GUI-based agent baseline, GOI improves task success rates by 67% and reduces interaction steps by 43.5%. Notably, GOI completes over 61% of successful tasks with a single LLM call.

A Case for Declarative LLM-friendly Interfaces for Improved Efficiency of Computer-Use Agents

TL;DR

The paper addresses the inefficiency of LLM-powered computer-use agents interacting with GUIs by introducing GOI, a declarative interface that separates policy from mechanism via three primitives: Access, State, and Observation. GOI builds a path-unambiguous navigation topology and uses context-efficient descriptions to enable LLMs to declare desired outcomes while GOI deterministically handles navigation and interactions. Across Microsoft Office apps, GOI significantly improves task success rates (up to 1.67x) and reduces interaction steps (up to 43.5%), with many tasks completed in a single LLM call, demonstrating robust, scalable GUI automation without API access. The work highlights practical benefits, cross-OS considerations, and future directions for declarative interfaces tailored to LLM capabilities in real-world productivity software.

Abstract

Computer-use agents (CUAs) powered by large language models (LLMs) have emerged as a promising approach to automating computer tasks, yet they struggle with graphical user interfaces (GUIs). GUIs, designed for humans, force LLMs to decompose high-level goals into lengthy, error-prone sequences of fine-grained actions, resulting in low success rates and an excessive number of LLM calls. We propose Goal-Oriented Interface (GOI), a novel abstraction that transforms existing GUIs into three declarative primitives: access, state, and observation, which are better suited for LLMs. Our key idea is policy-mechanism separation: LLMs focus on high-level semantic planning (policy) while GOI handles low-level navigation and interaction (mechanism). GOI does not require modifying the application source code or relying on application programming interfaces (APIs). We evaluate GOI with Microsoft Office Suite (Word, PowerPoint, Excel) on Windows. Compared to a leading GUI-based agent baseline, GOI improves task success rates by 67% and reduces interaction steps by 43.5%. Notably, GOI completes over 61% of successful tasks with a single LLM call.

Paper Structure

This paper contains 28 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of the GOI abstraction layer. The GOI is based on ubiquitous GUI and OS accessibility features OSAccessibilityWindowsUIALinux-AT-SPI2. Declarative specifies the intended state or outcome; imperative enumerates the actions that realize it.
  • Figure 2: Policy-Mechanism coupling in GUI use.
  • Figure 3: GOI Workflow: Offline Modeling and Online Execution. GOI has three declarative primitives: Access, State, and Observation.
  • Figure 4: Navigation Topology. To access the bottom-right node along node 4, imperative GUI navigation relies on graph and requires the explicit path [1, 4, 6, 9, 12]. With declaration, only [19] is required by tree, while the number of nodes explodes. For forest, the path required is [7, 14].
  • Figure 5: Performance evaluation.
  • ...and 1 more figures