Table of Contents
Fetching ...

ScreenAudit: Detecting Screen Reader Accessibility Errors in Mobile Apps Using Large Language Models

Mingyuan Zhong, Ruolin Chen, Xia Chen, James Fogarty, Jacob O. Wobbrock

TL;DR

This work introduces Screen-Au-dit, an LLM-powered accessibility auditor that traverses Android app screens using a TalkBack-enabled recorder, captures transcripts and UI data, and leverages GPT-4o to identify screen reader accessibility errors beyond traditional rule-based checkers. In expert evaluations across 14 screens, Screen-Au-dit achieved 69.2% coverage with 71.3% precision, significantly outperforming baseline tools. The study also analyzes prompting strategies, showing that general accessibility guidance plus contextual prompts yields best recall, and demonstrates that Screen-Au-dit can complement, not replace, existing checkers. The findings suggest a practical path toward faster, more expressive accessibility feedback and motivate future work on broader UI element coverage, richer context, simulated user testing, and potential code-level integration.

Abstract

Many mobile apps are inaccessible, thereby excluding people from their potential benefits. Existing rule-based accessibility checkers aim to mitigate these failures by identifying errors early during development but are constrained in the types of errors they can detect. We present ScreenAudit, an LLM-powered system designed to traverse mobile app screens, extract metadata and transcripts, and identify screen reader accessibility errors overlooked by existing checkers. We recruited six accessibility experts including one screen reader user to evaluate ScreenAudit's reports across 14 unique app screens. Our findings indicate that ScreenAudit achieves an average coverage of 69.2%, compared to only 31.3% with a widely-used accessibility checker. Expert feedback indicated that ScreenAudit delivered higher-quality feedback and addressed more aspects of screen reader accessibility compared to existing checkers, and that ScreenAudit would benefit app developers in real-world settings.

ScreenAudit: Detecting Screen Reader Accessibility Errors in Mobile Apps Using Large Language Models

TL;DR

This work introduces Screen-Au-dit, an LLM-powered accessibility auditor that traverses Android app screens using a TalkBack-enabled recorder, captures transcripts and UI data, and leverages GPT-4o to identify screen reader accessibility errors beyond traditional rule-based checkers. In expert evaluations across 14 screens, Screen-Au-dit achieved 69.2% coverage with 71.3% precision, significantly outperforming baseline tools. The study also analyzes prompting strategies, showing that general accessibility guidance plus contextual prompts yields best recall, and demonstrates that Screen-Au-dit can complement, not replace, existing checkers. The findings suggest a practical path toward faster, more expressive accessibility feedback and motivate future work on broader UI element coverage, richer context, simulated user testing, and potential code-level integration.

Abstract

Many mobile apps are inaccessible, thereby excluding people from their potential benefits. Existing rule-based accessibility checkers aim to mitigate these failures by identifying errors early during development but are constrained in the types of errors they can detect. We present ScreenAudit, an LLM-powered system designed to traverse mobile app screens, extract metadata and transcripts, and identify screen reader accessibility errors overlooked by existing checkers. We recruited six accessibility experts including one screen reader user to evaluate ScreenAudit's reports across 14 unique app screens. Our findings indicate that ScreenAudit achieves an average coverage of 69.2%, compared to only 31.3% with a widely-used accessibility checker. Expert feedback indicated that ScreenAudit delivered higher-quality feedback and addressed more aspects of screen reader accessibility compared to existing checkers, and that ScreenAudit would benefit app developers in real-world settings.

Paper Structure

This paper contains 63 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Screen-Au-dit captures accessibility metadata from an app screen, including TalkBack transcripts. An LLM is used to evaluate potential screen reader accessibility errors, which are presented in a report. In the example from Amazon Music, Screen-Au-dit identifies the errors of using internal identifiers and redundant labels and provides actionable advice.
  • Figure 2: The Report Viewer of Screen-Au-dit displaying accessibility errors detected in a mobile application. The report entries for item 1 (unlabeled add button for "From" station) and item 12 (a redundant label) are expanded with their details shown. The "Issue Filter" menu (top) allows filtering of errors by type.
  • Figure 3: Accessibility errors identified by Screen-Au-dit and Accessibility Scanner. Axe axe did not identify any of the errors listed. Abbreviations: LQ = Label Quality. SG = Structure & Grouping. Head = Heading. Func = Functionality.
  • Figure 4: Stacked bar plot for each prompt or tool's performance by error category, with expert labels on the left as reference. Counts for ML, LQ, SG, Head, Func represent the number of true positives. Count for NE represents the number of true negatives. Count for X represents the number of classification errors.