How Multimodal Large Language Models Support Access to Visual Information: A Diary Study With Blind and Low Vision People

Ricardo E. Gonzalez Penuela; Crescentia Jung; Sharon Y Lin; Ruiying Hu; Shiri Azenkot

How Multimodal Large Language Models Support Access to Visual Information: A Diary Study With Blind and Low Vision People

Ricardo E. Gonzalez Penuela, Crescentia Jung, Sharon Y Lin, Ruiying Hu, Shiri Azenkot

TL;DR

The findings show that while MLLMs can improve visual interpretations'descriptive accuracy, supporting everyday use also depends on the"visual assistant"skill: behaviors for providing goal-directed, reliable assistance.

Abstract

Multimodal large language models (MLLMs) are changing how Blind and Low Vision (BLV) people access visual information. Unlike traditional visual interpretation tools that only provide descriptions, MLLM-enabled applications offer conversational assistance, where users can ask questions to obtain goal-relevant details. However, evidence about their performance in the real-world and implications for BLV people's daily lives remains limited. To address this, we conducted a two-week diary study, where we captured 20 BLV participants' use of an MLLM-enabled visual interpretation application. Although participants rated the visual interpretations of the application as "trustworthy" (mean=3.76 out of 5, max=extremely trustworthy) and "somewhat satisfying" (mean=4.13 out of 5, max=very satisfying), the AI often produced incorrect answers (22.2%) or abstained (10.8%) from responding to users' requests. Our findings show that while MLLMs can improve visual interpretations' descriptive accuracy, supporting everyday use also depends on the "visual assistant" skill: behaviors for providing goal-directed, reliable assistance. We conclude by proposing the "visual assistant" skill and guidelines to help MLLM-enabled visual interpretation applications better support BLV people's access to visual information.

How Multimodal Large Language Models Support Access to Visual Information: A Diary Study With Blind and Low Vision People

TL;DR

Abstract

Paper Structure (38 sections, 16 figures, 7 tables)

This paper contains 38 sections, 16 figures, 7 tables.

Introduction
Contributions, and Limitations
Related Work
Human-Powered Visual Interpretation Systems
AI-powered Visual Interpretation Systems
Method
Participants and Recruitment
Procedure
VisionPal: Our Data Collection Application
Data
Analyzing Diary Entries Context
Coding Context of Use: User Goal and Location
Categorizing User Questions
Analyzing MLLM Visual Interpretations Accuracy
Photo Description Accuracy
...and 23 more sections

Figures (16)

Figure 1: Our application, VisionPal, interaction design is based on Seeing AI. In both applications, the user (1) opens the application to begin a visual assistance task, (2) takes a photo of their surroundings to receive an overview of key elements in the image, (3) chats with an MLLM to ask follow-up questions or clarify specific details, and (4) provides feedback on on their experience to improve future interactions.
Figure 2: Demonstrative examples of our coding approach for initial description accuracy. Highlighted in red, the first photo description contains no hallucinations and accurately read the seating section information on the overhead bin compartment, the second one misidentifies the seat number (one hallucination), and the third description misidentifies the seat number and misidentifies the color of the suitcase (two hallucinations).
Figure 3: Our coding approach to determine response correctness. We present an example question with example responses demonstrating varying levels of truthful and complete information.
Figure 4: An example of our coding approach for determining question answerability. For the first question "Is the cat asleep or awake?", there is visual evidence that the cat is awake so the question is labeled as "answerable". For the second question "What emotion is this cat feeling right now", there is no visual evidence of the emotions the cat is feeling so the question is labeled as "unanswerable".
Figure 5: Summary of initial description accuracy scores. Most descriptions were highly accurate, with 91.8% receiving the maximum score of 3 (no hallucinations).
...and 11 more figures

How Multimodal Large Language Models Support Access to Visual Information: A Diary Study With Blind and Low Vision People

TL;DR

Abstract

How Multimodal Large Language Models Support Access to Visual Information: A Diary Study With Blind and Low Vision People

Authors

TL;DR

Abstract

Table of Contents

Figures (16)