Table of Contents
Fetching ...

WebAccessVL: Making an Accessible Web via Violation-Conditioned VLM

Amber Yijia Zheng, Jae Joong Lee, Bedrich Benes, Raymond A. Yeh

TL;DR

This work reframes web accessibility as image-conditioned program synthesis, where a violation-aware vision-language model edits HTML to satisfy WCAG 2.x while preserving the original rendering. It introduces WebAccessVL, a dataset of 2,500 webpages with manually corrected WCAG2-compliant HTML, and a violation-conditioned VLM that uses WCAG2 violation counts as guidance. Empirical results show substantial reductions in violations (down to about 0.44 per site on the test set) and preserved visual fidelity, outperforming open APIs and many SFT baselines, with human perceptual studies corroborating design preservation. By releasing code and data, it aims to catalyze research in automated, privacy-conscious web accessibility tooling that assists developers rather than replacing them.

Abstract

We present a vision-language model (VLM) that automatically edits website HTML to address Web Content Accessibility Guidelines 2 (WCAG2) violations. We formulate this as a supervised image-conditioned program synthesis task, where the model learns to correct HTML given the HTML and its rendering. We collected WebAccessVL, a new dataset with manually corrected accessibility violations, establishing paired training data. We then propose a violation-conditioned VLM that additionally conditions on the WCAG2 violation count to guide the correction process. Experiments demonstrate that our method effectively reduces the average number of violations from 5.34 to 0.44 per website, outperforming commercial LLM APIs (Gemini, GPT-5). A perceptual study confirms that our edited websites maintain the original visual appearance and content.

WebAccessVL: Making an Accessible Web via Violation-Conditioned VLM

TL;DR

This work reframes web accessibility as image-conditioned program synthesis, where a violation-aware vision-language model edits HTML to satisfy WCAG 2.x while preserving the original rendering. It introduces WebAccessVL, a dataset of 2,500 webpages with manually corrected WCAG2-compliant HTML, and a violation-conditioned VLM that uses WCAG2 violation counts as guidance. Empirical results show substantial reductions in violations (down to about 0.44 per site on the test set) and preserved visual fidelity, outperforming open APIs and many SFT baselines, with human perceptual studies corroborating design preservation. By releasing code and data, it aims to catalyze research in automated, privacy-conscious web accessibility tooling that assists developers rather than replacing them.

Abstract

We present a vision-language model (VLM) that automatically edits website HTML to address Web Content Accessibility Guidelines 2 (WCAG2) violations. We formulate this as a supervised image-conditioned program synthesis task, where the model learns to correct HTML given the HTML and its rendering. We collected WebAccessVL, a new dataset with manually corrected accessibility violations, establishing paired training data. We then propose a violation-conditioned VLM that additionally conditions on the WCAG2 violation count to guide the correction process. Experiments demonstrate that our method effectively reduces the average number of violations from 5.34 to 0.44 per website, outperforming commercial LLM APIs (Gemini, GPT-5). A perceptual study confirms that our edited websites maintain the original visual appearance and content.
Paper Structure (19 sections, 12 equations, 12 figures, 8 tables)

This paper contains 19 sections, 12 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Given an input HTML with accessibility violations (visualized by boxes in red), e.g., poor contrast, our method makes the HTML WCAG2 compliant by refining it to have improved contrast, a better layout, and an appropriate alt-text.
  • Figure 2: Overview of our image-conditioned program (HTML) synthesis pipeline. Our model takes in an HTML ${\bm{x}}$, the rendered website image ${\bm{I}}$, and the violation condition ${\bm{c}}$. The Vision-Language Model processes the input tokens and vision tokens extracted from ${\bm{I}}$ to generate output tokens autoregressively. Our negative guidance sampling refines the token logits to reduce the number of violations in the output, ensuring better compliance with WCAG 2.2 accessibility guidelines.
  • Figure 3: The percentage improvement in violation for open API/weights models compared to the raw HTML of the test set.
  • Figure 4: The percentage improvement in violation for SFT models compared to the raw HTML of the test set.
  • Figure 5: The percentage of vision/language violations out of all the violations from each LLM and VLM.
  • ...and 7 more figures