Table of Contents
Fetching ...

Human or LLM? A Comparative Study on Accessible Code Generation Capability

Hyunjae Suh, Mahan Tafreshipour, Sam Malek, Iftekhar Ahmed

TL;DR

The paper investigates whether LLMs can generate accessible web UI code and how that compares to human-authored code. It uses ten real-world projects, regenerates UI components via a two-stage process (summaries then generation) with two LLMs, and evaluates accessibility with AChecker and QualWeb. Advanced prompting offers limited gains and can introduce ARIA issues, prompting the development of FeedA11y, a ReAct-based feedback loop that uses accessibility evaluations to iteratively fix code. Results demonstrate LLMs often outperform humans on basic accessibility tasks, and FeedA11y yields the strongest gains, highlighting the value of evaluation-driven generation for accessible web content.

Abstract

Web accessibility is essential for inclusive digital experiences, yet the accessibility of LLM-generated code remains underexplored. This paper presents an empirical study comparing the accessibility of web code generated by GPT-4o and Qwen2.5-Coder-32B-Instruct-AWQ against human-written code. Results show that LLMs often produce more accessible code, especially for basic features like color contrast and alternative text, but struggle with complex issues such as ARIA attributes. We also assess advanced prompting strategies (Zero-Shot, Few-Shot, Self-Criticism), finding they offer some gains but are limited. To address these gaps, we introduce FeedA11y, a feedback-driven ReAct-based approach that significantly outperforms other methods in improving accessibility. Our work highlights the promise of LLMs for accessible code generation and emphasizes the need for feedback-based techniques to address persistent challenges.

Human or LLM? A Comparative Study on Accessible Code Generation Capability

TL;DR

The paper investigates whether LLMs can generate accessible web UI code and how that compares to human-authored code. It uses ten real-world projects, regenerates UI components via a two-stage process (summaries then generation) with two LLMs, and evaluates accessibility with AChecker and QualWeb. Advanced prompting offers limited gains and can introduce ARIA issues, prompting the development of FeedA11y, a ReAct-based feedback loop that uses accessibility evaluations to iteratively fix code. Results demonstrate LLMs often outperform humans on basic accessibility tasks, and FeedA11y yields the strongest gains, highlighting the value of evaluation-driven generation for accessible web content.

Abstract

Web accessibility is essential for inclusive digital experiences, yet the accessibility of LLM-generated code remains underexplored. This paper presents an empirical study comparing the accessibility of web code generated by GPT-4o and Qwen2.5-Coder-32B-Instruct-AWQ against human-written code. Results show that LLMs often produce more accessible code, especially for basic features like color contrast and alternative text, but struggle with complex issues such as ARIA attributes. We also assess advanced prompting strategies (Zero-Shot, Few-Shot, Self-Criticism), finding they offer some gains but are limited. To address these gaps, we introduce FeedA11y, a feedback-driven ReAct-based approach that significantly outperforms other methods in improving accessibility. Our work highlights the promise of LLMs for accessible code generation and emphasizes the need for feedback-based techniques to address persistent challenges.

Paper Structure

This paper contains 24 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Methodology Overview
  • Figure 2: Progressive Enhancement of Prompts with Accessibility Instructions
  • Figure 3: FeedA11y Overview