Table of Contents
Fetching ...

Beyond Visual Perception: Insights from Smartphone Interaction of Visually Impaired Users with Large Multimodal Models

Jingyi Xie, Rui Yu, He Zhang, Syed Masum Billah, Sooyeon Lee, John M. Carroll

TL;DR

The paper evaluates how large multimodal models used for visual description affect daily activities of visually impaired users, highlighting gains in context awareness and social understanding alongside critical limits such as AI hallucinations and misinterpretation of identities. Through 14 interviews and analysis of Be My AI generated descriptions from participants and social media, the study reveals that current systems improve spatial awareness but often fail to infer user intent or provide reliable identity cues. It proposes design strategies including AI deferral learning, streamlined user-AI-RSA handoffs, and multi agent collaborations to reduce cognitive load and improve reliability. The findings underscore the need for real time video processing and hybrid human AI workflows to make LMM based VQA tools more effective, interactive, and personalized for PVI.

Abstract

Large multimodal models (LMMs) have enabled new AI-powered applications that help people with visual impairments (PVI) receive natural language descriptions of their surroundings through audible text. We investigated how this emerging paradigm of visual assistance transforms how PVI perform and manage their daily tasks. Moving beyond usability assessments, we examined both the capabilities and limitations of LMM-based tools in personal and social contexts, while exploring design implications for their future development. Through interviews with 14 visually impaired users of Be My AI (an LMM-based application) and analysis of its image descriptions from both study participants and social media platforms, we identified two key limitations. First, these systems' context awareness suffers from hallucinations and misinterpretations of social contexts, styles, and human identities. Second, their intent-oriented capabilities often fail to grasp and act on users' intentions. Based on these findings, we propose design strategies for improving both human-AI and AI-AI interactions, contributing to the development of more effective, interactive, and personalized assistive technologies.

Beyond Visual Perception: Insights from Smartphone Interaction of Visually Impaired Users with Large Multimodal Models

TL;DR

The paper evaluates how large multimodal models used for visual description affect daily activities of visually impaired users, highlighting gains in context awareness and social understanding alongside critical limits such as AI hallucinations and misinterpretation of identities. Through 14 interviews and analysis of Be My AI generated descriptions from participants and social media, the study reveals that current systems improve spatial awareness but often fail to infer user intent or provide reliable identity cues. It proposes design strategies including AI deferral learning, streamlined user-AI-RSA handoffs, and multi agent collaborations to reduce cognitive load and improve reliability. The findings underscore the need for real time video processing and hybrid human AI workflows to make LMM based VQA tools more effective, interactive, and personalized for PVI.

Abstract

Large multimodal models (LMMs) have enabled new AI-powered applications that help people with visual impairments (PVI) receive natural language descriptions of their surroundings through audible text. We investigated how this emerging paradigm of visual assistance transforms how PVI perform and manage their daily tasks. Moving beyond usability assessments, we examined both the capabilities and limitations of LMM-based tools in personal and social contexts, while exploring design implications for their future development. Through interviews with 14 visually impaired users of Be My AI (an LMM-based application) and analysis of its image descriptions from both study participants and social media platforms, we identified two key limitations. First, these systems' context awareness suffers from hallucinations and misinterpretations of social contexts, styles, and human identities. Second, their intent-oriented capabilities often fail to grasp and act on users' intentions. Based on these findings, we propose design strategies for improving both human-AI and AI-AI interactions, contributing to the development of more effective, interactive, and personalized assistive technologies.

Paper Structure

This paper contains 49 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Be My AI's description of a dog, including subjective interpretations of the dog's emotions. This screenshot was provided by P9.
  • Figure 2: On the left is the original image sent to Be My AI. On the right is Be My AI's description of eggs in a frying pan, followed by a question checking for the presence of eggshells. This example was originally drawn from X.
  • Figure 3: Be My AI's description of a conference room, with the original image cropped. This example was drawn from X.
  • Figure 4: The top shows the status quo of handoff between the user and Be My AI. The bottom illustrates our proposed simplified interaction.
  • Figure 5: Handoff between the user, Be My AI, and RSA for identity interpretations.
  • ...and 1 more figures