Table of Contents
Fetching ...

Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users

Antonia Karamolegkou, Malvina Nikandrou, Georgios Pantazopoulos, Danae Sanchez Villegas, Phillip Rust, Ruchira Dhar, Daniel Hershcovich, Anders Søgaard

TL;DR

This work evaluates Multimodal Large Language Models (MLLMs) as visual assistants for visually impaired users through a user-centered study and a five-task benchmark (Image Captioning, Image QA, Optical Braille Recognition, Video Object Recognition, and Video QA) evaluated on twelve models. A BLV-focused user survey informs task design and highlights priorities, challenges, and trust concerns, particularly around inaccuracies and cultural/multilingual context. The evaluation reveals substantial gaps: culture-aware and multilingual captioning, Braille-reading capabilities, assistive-object recognition in video, and safety-related hallucinations in adversarial questions. The findings underscore the need for inclusive datasets, improved cultural and linguistic robustness, Braille-reading capabilities, and user-centered evaluation to drive trustworthy, real-world visual assistance for BLV users.

Abstract

This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals. We conduct a user survey to identify adoption patterns and key challenges users face with such technologies. Despite a high adoption rate of these models, our findings highlight concerns related to contextual understanding, cultural sensitivity, and complex scene understanding, particularly for individuals who may rely solely on them for visual interpretation. Informed by these results, we collate five user-centred tasks with image and video inputs, including a novel task on Optical Braille Recognition. Our systematic evaluation of twelve MLLMs reveals that further advancements are necessary to overcome limitations related to cultural context, multilingual support, Braille reading comprehension, assistive object recognition, and hallucinations. This work provides critical insights into the future direction of multimodal AI for accessibility, underscoring the need for more inclusive, robust, and trustworthy visual assistance technologies.

Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users

TL;DR

This work evaluates Multimodal Large Language Models (MLLMs) as visual assistants for visually impaired users through a user-centered study and a five-task benchmark (Image Captioning, Image QA, Optical Braille Recognition, Video Object Recognition, and Video QA) evaluated on twelve models. A BLV-focused user survey informs task design and highlights priorities, challenges, and trust concerns, particularly around inaccuracies and cultural/multilingual context. The evaluation reveals substantial gaps: culture-aware and multilingual captioning, Braille-reading capabilities, assistive-object recognition in video, and safety-related hallucinations in adversarial questions. The findings underscore the need for inclusive datasets, improved cultural and linguistic robustness, Braille-reading capabilities, and user-centered evaluation to drive trustworthy, real-world visual assistance for BLV users.

Abstract

This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals. We conduct a user survey to identify adoption patterns and key challenges users face with such technologies. Despite a high adoption rate of these models, our findings highlight concerns related to contextual understanding, cultural sensitivity, and complex scene understanding, particularly for individuals who may rely solely on them for visual interpretation. Informed by these results, we collate five user-centred tasks with image and video inputs, including a novel task on Optical Braille Recognition. Our systematic evaluation of twelve MLLMs reveals that further advancements are necessary to overcome limitations related to cultural context, multilingual support, Braille reading comprehension, assistive object recognition, and hallucinations. This work provides critical insights into the future direction of multimodal AI for accessibility, underscoring the need for more inclusive, robust, and trustworthy visual assistance technologies.

Paper Structure

This paper contains 58 sections, 21 figures, 14 tables.

Figures (21)

  • Figure 1: User survey results highlighting the 15 most important terms (measured by TF-IDF scores), representing key challenges for AI visual assistants. (*) includes tasks such as object, handwriting and face recognition; and image, scene, and video description.
  • Figure 2: Illustration of the five key areas of our framework. We focus on tasks pertinent to BLV people covering different aspects for captioning, transcribing, and answering questions about visual content.
  • Figure 3: Left: Average chrF++ on sentence-level Braille-to-Text transcription. Right: F1-Score on cross-script question answering where results are binned based on the length of the context paragraph.
  • Figure 4: Age, Gender, and Ethnicity demographics extracted from Prolific after filtering the data to remove the "revoked_consent" options.
  • Figure 5: Responses on the potential adoption of AI models as visual assistants.
  • ...and 16 more figures