Beyond Visual Perception: Insights from Smartphone Interaction of Visually Impaired Users with Large Multimodal Models

Jingyi Xie; Rui Yu; He Zhang; Syed Masum Billah; Sooyeon Lee; John M. Carroll

Beyond Visual Perception: Insights from Smartphone Interaction of Visually Impaired Users with Large Multimodal Models

Jingyi Xie, Rui Yu, He Zhang, Syed Masum Billah, Sooyeon Lee, John M. Carroll

TL;DR

The paper evaluates how large multimodal models used for visual description affect daily activities of visually impaired users, highlighting gains in context awareness and social understanding alongside critical limits such as AI hallucinations and misinterpretation of identities. Through 14 interviews and analysis of Be My AI generated descriptions from participants and social media, the study reveals that current systems improve spatial awareness but often fail to infer user intent or provide reliable identity cues. It proposes design strategies including AI deferral learning, streamlined user-AI-RSA handoffs, and multi agent collaborations to reduce cognitive load and improve reliability. The findings underscore the need for real time video processing and hybrid human AI workflows to make LMM based VQA tools more effective, interactive, and personalized for PVI.

Abstract

Large multimodal models (LMMs) have enabled new AI-powered applications that help people with visual impairments (PVI) receive natural language descriptions of their surroundings through audible text. We investigated how this emerging paradigm of visual assistance transforms how PVI perform and manage their daily tasks. Moving beyond usability assessments, we examined both the capabilities and limitations of LMM-based tools in personal and social contexts, while exploring design implications for their future development. Through interviews with 14 visually impaired users of Be My AI (an LMM-based application) and analysis of its image descriptions from both study participants and social media platforms, we identified two key limitations. First, these systems' context awareness suffers from hallucinations and misinterpretations of social contexts, styles, and human identities. Second, their intent-oriented capabilities often fail to grasp and act on users' intentions. Based on these findings, we propose design strategies for improving both human-AI and AI-AI interactions, contributing to the development of more effective, interactive, and personalized assistive technologies.

Beyond Visual Perception: Insights from Smartphone Interaction of Visually Impaired Users with Large Multimodal Models

TL;DR

Abstract

Beyond Visual Perception: Insights from Smartphone Interaction of Visually Impaired Users with Large Multimodal Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)