Table of Contents
Fetching ...

Inferring Alt-text For UI Icons With Large Language Models During App Development

Sabrina Haque, Christoph Csallner

TL;DR

Icon accessibility in mobile apps is often hampered by missing or uninformative alt-text for icons. The authors introduce IconDesc, a pipeline that combines an icon-only label from a large multi-modal model, in-icon OCR text, and surrounding DOM-tree context to generate informative alt-text from partial UI data, enabling accessibility-enhanced development workflows. They fine-tune a small LLM on ~1.4k icons and evaluate against DL baselines and SOTA VLMs using WC20-derived ground truth and multiple metrics (e.g., CIDEr, SPICE) plus a user study, showing substantial improvements, especially for partial-screen contexts. This approach offers a practical, low-data, and context-aware solution that can be integrated into development tools to accelerate accessibility improvements during iterative UI design.

Abstract

Ensuring accessibility in mobile applications remains a significant challenge, particularly for visually impaired users who rely on screen readers. User interface icons are essential for navigation and interaction and often lack meaningful alt-text, creating barriers to effective use. Traditional deep learning approaches for generating alt-text require extensive datasets and struggle with the diversity and imbalance of icon types. More recent Vision Language Models (VLMs) require complete UI screens, which can be impractical during the iterative phases of app development. To address these issues, we introduce a novel method using Large Language Models (LLMs) to autonomously generate informative alt-text for mobile UI icons with partial UI data. By incorporating icon context, that include class, resource ID, bounds, OCR-detected text, and contextual information from parent and sibling nodes, we fine-tune an off-the-shelf LLM on a small dataset of approximately 1.4k icons, yielding IconDesc. In an empirical evaluation and a user study IconDesc demonstrates significant improvements in generating relevant alt-text. This ability makes IconDesc an invaluable tool for developers, aiding in the rapid iteration and enhancement of UI accessibility.

Inferring Alt-text For UI Icons With Large Language Models During App Development

TL;DR

Icon accessibility in mobile apps is often hampered by missing or uninformative alt-text for icons. The authors introduce IconDesc, a pipeline that combines an icon-only label from a large multi-modal model, in-icon OCR text, and surrounding DOM-tree context to generate informative alt-text from partial UI data, enabling accessibility-enhanced development workflows. They fine-tune a small LLM on ~1.4k icons and evaluate against DL baselines and SOTA VLMs using WC20-derived ground truth and multiple metrics (e.g., CIDEr, SPICE) plus a user study, showing substantial improvements, especially for partial-screen contexts. This approach offers a practical, low-data, and context-aware solution that can be integrated into development tools to accelerate accessibility improvements during iterative UI design.

Abstract

Ensuring accessibility in mobile applications remains a significant challenge, particularly for visually impaired users who rely on screen readers. User interface icons are essential for navigation and interaction and often lack meaningful alt-text, creating barriers to effective use. Traditional deep learning approaches for generating alt-text require extensive datasets and struggle with the diversity and imbalance of icon types. More recent Vision Language Models (VLMs) require complete UI screens, which can be impractical during the iterative phases of app development. To address these issues, we introduce a novel method using Large Language Models (LLMs) to autonomously generate informative alt-text for mobile UI icons with partial UI data. By incorporating icon context, that include class, resource ID, bounds, OCR-detected text, and contextual information from parent and sibling nodes, we fine-tune an off-the-shelf LLM on a small dataset of approximately 1.4k icons, yielding IconDesc. In an empirical evaluation and a user study IconDesc demonstrates significant improvements in generating relevant alt-text. This ability makes IconDesc an invaluable tool for developers, aiding in the rapid iteration and enhancement of UI accessibility.
Paper Structure (36 sections, 7 figures, 5 tables)

This paper contains 36 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Example zoom out (left) vs. lower volume (right) minus buttons in Rico.
  • Figure 2: Distribution of icon captions in LabelDroid and our WC20 subset. LabelDroid is more repetitive and less diverse.
  • Figure 3: Overview: Via the DOM tree IconDesc extracts text properties of an icon's surrounding elements (yellow-dashed box), maps the icon's pixels to an icon-only label, and extracts in-icon text. On this Strobe Light app screen (from Rico) baselines infer "select the <UNK>" (Coala) and "power button" (PaliGemma). IconDesc infers "turn on the music" (WC20 ground-truth labels: "toggle music" and "turn on the music").
  • Figure 4: In-icon text examples from Rico.
  • Figure 5: Input formats for Pix2Struct and PaliGemma
  • ...and 2 more figures