Towards Scalable Web Accessibility Audit with MLLMs as Copilots
Ming Gu, Ziwei Wang, Sicen Lai, Zirui Gao, Sheng Zhou, Jiajun Bu
TL;DR
The paper tackles the scalability bottleneck of web accessibility audits by proposing AAA, a full-lifecycle framework that operationalizes WCAG-EM through Automation, AI, and Auditor. It introduces GRASP, a graph-based multimodal sampling method that leverages textual, visual, and relational page representations to produce representative site subsets, and MaC, a multimodal LLM-powered copilot that assists across sampling and manual evaluation tasks. Four benchmark datasets (TPS, APR, CCT, CPE) are released to evaluate the WAA pipeline and ML-assisted components. Experiments show that small, fine-tuned MLLMs can approach the capabilities of larger models and that GRASP with a heterophilic GNN (IGNN) yields superior sampling quality, indicating practical viability for scalable accessibility auditing. The work lays groundwork for scalable, AI-assisted audits and could influence future standards and tooling by providing open benchmarks and end-to-end methodologies.
Abstract
Ensuring web accessibility is crucial for advancing social welfare, justice, and equality in digital spaces, yet the vast majority of website user interfaces remain non-compliant, due in part to the resource-intensive and unscalable nature of current auditing practices. While WCAG-EM offers a structured methodology for site-wise conformance evaluation, it involves great human efforts and lacks practical support for execution at scale. In this work, we present an auditing framework, AAA, which operationalizes WCAG-EM through a human-AI partnership model. AAA is anchored by two key innovations: GRASP, a graph-based multimodal sampling method that ensures representative page coverage via learned embeddings of visual, textual, and relational cues; and MaC, a multimodal large language model-based copilot that supports auditors through cross-modal reasoning and intelligent assistance in high-effort tasks. Together, these components enable scalable, end-to-end web accessibility auditing, empowering human auditors with AI-enhanced assistance for real-world impact. We further contribute four novel datasets designed for benchmarking core stages of the audit pipeline. Extensive experiments demonstrate the effectiveness of our methods, providing insights that small-scale language models can serve as capable experts when fine-tuned.
