Towards Understanding the Use of MLLM-Enabled Applications for Visual Interpretation by Blind and Low Vision People
Ricardo E. Gonzalez Penuela, Ruiying Hu, Sharon Lin, Tanisha Shende, Shiri Azenkot
TL;DR
This study investigates how blind and low-vision (BLV) users interact with multimodal large language model (MLLM)–enabled visual interpretation tools by deploying a two-week diary study with VisionPal, a GPT-4o–based app. Preliminary analysis of 60 diary entries from six participants reveals high satisfaction (4.15/5) and moderate trust (3.75/5), with many users engaging in brief follow-up conversations and applying the tool to high-stakes tasks like medication guidance. While a majority of questions are answered correctly, a non-trivial fraction are incorrect or unresolved, underscoring the persistent risk of over-trust in incorrect outputs. The work highlights the potential of MLLMs to improve BLV access to visual information while outlining a plan to scale analysis to the full 553 entries to derive design implications for robust, user-centered BLV visual interpretation systems.
Abstract
Blind and Low Vision (BLV) people have adopted AI-powered visual interpretation applications to address their daily needs. While these applications have been helpful, prior work has found that users remain unsatisfied by their frequent errors. Recently, multimodal large language models (MLLMs) have been integrated into visual interpretation applications, and they show promise for more descriptive visual interpretations. However, it is still unknown how this advancement has changed people's use of these applications. To address this gap, we conducted a two-week diary study in which 20 BLV people used an MLLM-enabled visual interpretation application we developed, and we collected 553 entries. In this paper, we report a preliminary analysis of 60 diary entries from 6 participants. We found that participants considered the application's visual interpretations trustworthy (mean 3.75 out of 5) and satisfying (mean 4.15 out of 5). Moreover, participants trusted our application in high-stakes scenarios, such as receiving medical dosage advice. We discuss our plan to complete our analysis to inform the design of future MLLM-enabled visual interpretation systems.
