Table of Contents
Fetching ...

A11y-CUA Dataset: Characterizing the Accessibility Gap in Computer Use Agents

Ananya Gubbi Mohanbabu, Rosiana Natalie, Brandon Kim, Anhong Guo, Amy Pavel

TL;DR

A11y-CUA addresses the accessibility gap in Computer Use Agents by introducing a multimodal dataset that records BLVU and SU interactions across 60 real-world tasks on Windows. The authors provide a robust data-collection pipeline and open-source recorder to capture synchronized screen video, audio, OS events, DOM/ARIA data, and accessibility settings, enabling replayable traces for benchmarking. Analyses show clear distinctions between SU and BLVU interaction styles and substantial variation within each group, and they reveal that state-of-the-art CUAs under AT constraints markedly underperform compared to human users. The study also evaluates three AT-mimicking CUA configurations, highlighting perception, cognitive, and action gaps, outlining limitations, and proposing directions toward accessibility-aware CUAs, including simulation, personas, collaborative assistants, and tutorials. Overall, A11y-CUA provides a valuable resource for benchmarking, simulating, and building collaborative, AT-aware CUAs that better support BLVU users and promote inclusive computer use.

Abstract

Computer Use Agents (CUAs) operate interfaces by pointing, clicking, and typing -- mirroring interactions of sighted users (SUs) who can thus monitor CUAs and share control. CUAs do not reflect interactions by blind and low-vision users (BLVUs) who use assistive technology (AT). BLVUs thus cannot easily collaborate with CUAs. To characterize the accessibility gap of CUAs, we present A11y-CUA, a dataset of BLVUs and SUs performing 60 everyday tasks with 40.4 hours and 158,325 events. Our dataset analysis reveals that our collected interaction traces quantitatively confirm distinct interaction styles between SU and BLVU groups (mouse- vs. keyboard-dominant) and demonstrate interaction diversity within each group (sequential vs. shortcut navigation for BLVUs). We then compare collected traces to state-of-the-art CUAs under default and AT conditions (keyboard-only, magnifier). The default CUA executed 78.3% of tasks successfully. But with the AT conditions, CUA's performance dropped to 41.67% and 28.3% with keyboard-only and magnifier conditions respectively, and did not reflect nuances of real AT use. With our open A11y-CUA dataset, we aim to promote collaborative and accessible CUAs for everyone.

A11y-CUA Dataset: Characterizing the Accessibility Gap in Computer Use Agents

TL;DR

A11y-CUA addresses the accessibility gap in Computer Use Agents by introducing a multimodal dataset that records BLVU and SU interactions across 60 real-world tasks on Windows. The authors provide a robust data-collection pipeline and open-source recorder to capture synchronized screen video, audio, OS events, DOM/ARIA data, and accessibility settings, enabling replayable traces for benchmarking. Analyses show clear distinctions between SU and BLVU interaction styles and substantial variation within each group, and they reveal that state-of-the-art CUAs under AT constraints markedly underperform compared to human users. The study also evaluates three AT-mimicking CUA configurations, highlighting perception, cognitive, and action gaps, outlining limitations, and proposing directions toward accessibility-aware CUAs, including simulation, personas, collaborative assistants, and tutorials. Overall, A11y-CUA provides a valuable resource for benchmarking, simulating, and building collaborative, AT-aware CUAs that better support BLVU users and promote inclusive computer use.

Abstract

Computer Use Agents (CUAs) operate interfaces by pointing, clicking, and typing -- mirroring interactions of sighted users (SUs) who can thus monitor CUAs and share control. CUAs do not reflect interactions by blind and low-vision users (BLVUs) who use assistive technology (AT). BLVUs thus cannot easily collaborate with CUAs. To characterize the accessibility gap of CUAs, we present A11y-CUA, a dataset of BLVUs and SUs performing 60 everyday tasks with 40.4 hours and 158,325 events. Our dataset analysis reveals that our collected interaction traces quantitatively confirm distinct interaction styles between SU and BLVU groups (mouse- vs. keyboard-dominant) and demonstrate interaction diversity within each group (sequential vs. shortcut navigation for BLVUs). We then compare collected traces to state-of-the-art CUAs under default and AT conditions (keyboard-only, magnifier). The default CUA executed 78.3% of tasks successfully. But with the AT conditions, CUA's performance dropped to 41.67% and 28.3% with keyboard-only and magnifier conditions respectively, and did not reflect nuances of real AT use. With our open A11y-CUA dataset, we aim to promote collaborative and accessible CUAs for everyone.
Paper Structure (47 sections, 10 figures, 6 tables)

This paper contains 47 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Illustrative example of interaction traces from SUs and BLVUs performing a task in the A11y-CUA dataset. SUs complete the task primarily through mouse interactions, resulting in fewer steps to complete the tasks. In contrast, BLVUs use keyboard navigation and screen reader feedback, which generally leads to longer interaction sequences.
  • Figure 2: Computer Use Recorder. The Local Application Logger runs the pipeline: it presents tasks, resets the environment, records screen video and system audio, logs desktop actions, and receives web actions from a Chrome Extension through a Flask web logger. All streams are aligned on a single timeline to produce synchronized outputs: task metadata, OS input actions, screen video, system audio, OS window/element context, AT settings, UIA trees, and web DOM and accessibility tree events.
  • Figure 3: Task completion rates by task category for SUs, BLVUs, and three CUA configurations (Default-CUA, SR-CUA, Magnifier-CUA) with Claude Sonnet 4.5 model. SUs complete nearly all tasks across categories, while BLVUs show slightly lower success rates: especially for workflow tasks. Default-CUA reaches moderate performance, approaching BLVUs for web & browsing, system operations, and media, but falls further behind on document editing and workflow tasks. SR-CUA and Magnifier-CUA perform substantially worse overall, with particularly low completion rates on workflow and media tasks.
  • Figure 4: Per-task completion times across 60 tasks. Each marker is one participant×task; orange symbols denote sighted users (SU1–SU8) and blue symbols denote blind and low-vision users (BLVU1–BLVU8). Tasks are grouped by category along the x-axis (vertical dividers). SUs complete most tasks quickly with a tighter spread (often <150 s), whereas BLVUs show higher median and greater variance, especially in Document Editing, Workflow and Media categories. The wide dispersion within both groups shows substantial within-group strategy differences.
  • Figure 5: Cross-BLVU standard deviation in keystrokes by task and keystroke type (lower is more consistent). Each dot reports, for a given task, the variance across BLVUs in one keystroke category: character input, navigation (Tab/Arrow), hotkeys (Ctrl, Alt, Win, Shift). Variability is generally small for browsing & web and system operations, but rises sharply for document editing (largest spikes, driven by character and navigation counts) and remains elevated in Workflow and media. Hotkeys categories are typically low-variance, with occasional Ctrl spikes during editing. The high dispersion indicates diverse strategies (typing vs. navigation vs. shortcuts) for the same task.
  • ...and 5 more figures