Table of Contents
Fetching ...

Context-Aware Image Descriptions for Web Accessibility

Ananya Gubbi Mohanbabu, Amy Pavel

TL;DR

This work tackles the accessibility gap for image descriptions seen by blind and low-vision users on the web by introducing context-aware image descriptions. A Chrome Extension automates extraction of webpage context and uses GPT-4V to generate descriptions that reflect both the image and its surrounding content, with a pipeline that emphasizes visual grounding through visually concrete text. In a pipeline evaluation and a 12-participant user study, context-aware descriptions achieved comparable accuracy and objectivity to baselines but were rated higher in quality, imaginability, relevance, and plausibility, with participants expressing strong interest in future use. The study also identifies risks such as trust, privacy, and potential over-reliance on automated details, and outlines future work including on-device processing, personalization, and broader evaluations to enhance real-world adoption and safety.

Abstract

Blind and low vision (BLV) internet users access images on the web via text descriptions. New vision-to-language models such as GPT-V, Gemini, and LLaVa can now provide detailed image descriptions on-demand. While prior research and guidelines state that BLV audiences' information preferences depend on the context of the image, existing tools for accessing vision-to-language models provide only context-free image descriptions by generating descriptions for the image alone without considering the surrounding webpage context. To explore how to integrate image context into image descriptions, we designed a Chrome Extension that automatically extracts webpage context to inform GPT-4V-generated image descriptions. We gained feedback from 12 BLV participants in a user study comparing typical context-free image descriptions to context-aware image descriptions. We then further evaluated our context-informed image descriptions with a technical evaluation. Our user evaluation demonstrated that BLV participants frequently prefer context-aware descriptions to context-free descriptions. BLV participants also rated context-aware descriptions significantly higher in quality, imaginability, relevance, and plausibility. All participants shared that they wanted to use context-aware descriptions in the future and highlighted the potential for use in online shopping, social media, news, and personal interest blogs.

Context-Aware Image Descriptions for Web Accessibility

TL;DR

This work tackles the accessibility gap for image descriptions seen by blind and low-vision users on the web by introducing context-aware image descriptions. A Chrome Extension automates extraction of webpage context and uses GPT-4V to generate descriptions that reflect both the image and its surrounding content, with a pipeline that emphasizes visual grounding through visually concrete text. In a pipeline evaluation and a 12-participant user study, context-aware descriptions achieved comparable accuracy and objectivity to baselines but were rated higher in quality, imaginability, relevance, and plausibility, with participants expressing strong interest in future use. The study also identifies risks such as trust, privacy, and potential over-reliance on automated details, and outlines future work including on-device processing, personalization, and broader evaluations to enhance real-world adoption and safety.

Abstract

Blind and low vision (BLV) internet users access images on the web via text descriptions. New vision-to-language models such as GPT-V, Gemini, and LLaVa can now provide detailed image descriptions on-demand. While prior research and guidelines state that BLV audiences' information preferences depend on the context of the image, existing tools for accessing vision-to-language models provide only context-free image descriptions by generating descriptions for the image alone without considering the surrounding webpage context. To explore how to integrate image context into image descriptions, we designed a Chrome Extension that automatically extracts webpage context to inform GPT-4V-generated image descriptions. We gained feedback from 12 BLV participants in a user study comparing typical context-free image descriptions to context-aware image descriptions. We then further evaluated our context-informed image descriptions with a technical evaluation. Our user evaluation demonstrated that BLV participants frequently prefer context-aware descriptions to context-free descriptions. BLV participants also rated context-aware descriptions significantly higher in quality, imaginability, relevance, and plausibility. All participants shared that they wanted to use context-aware descriptions in the future and highlighted the potential for use in online shopping, social media, news, and personal interest blogs.
Paper Structure (62 sections, 10 figures, 4 tables)

This paper contains 62 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Webpage image context examples across news, travel, social media and shopping categories.
  • Figure 2: The system takes a webpage and a selected webpage image as selected by the user then provides context-aware descriptions based on both the webpage content and selected image.
  • Figure 3: When a user clicks on an image in a website (right), our tool adds descriptions to the extension window (left).
  • Figure 4: In the pipeline evaluation, we evaluated the accuracy, objectivity, and relevancy percentages of each description by coding if each sentence in the description did not contain a hallucination (accurate), did not contain subjective details (objective), and did not contain irrelevant details (relevant). Bars represent the average % of accurate, objective, and description sentences for descriptions produced by the long and short version of each approach. The error bars are 95% confidence intervals.
  • Figure 5: The ratio of named entities (e.g., proper nouns) to words for the short and long version of each description across all categories. Low means few named entities and high means many named entities.
  • ...and 5 more figures