"Less is More": Reducing Cognitive Load and Task Drift in Real-Time Multimodal Assistive Agents for the Visually Impaired

Yi Zhao; Siqi Wang; Qiqun Geng; Erxin Yu; Jing Li

"Less is More": Reducing Cognitive Load and Task Drift in Real-Time Multimodal Assistive Agents for the Visually Impaired

Yi Zhao, Siqi Wang, Qiqun Geng, Erxin Yu, Jing Li

TL;DR

This work tackles high cognitive load and task drift in Vision–Language Model (VLM)–based assistive AI for the visually impaired. It introduces VIA-Agent, a brain–body system that co-optimizes a VIA-specialized VLM core for concise, goal-focused guidance with a low-latency Real-Time Communication (RTC) embodiment for fluid interaction. Through a formative study with 15 participants and a user evaluation with 9 participants against BeMyAI and Doubao, VIA-Agent demonstrates improved efficiency, reduced cognitive load and task drift, and higher usability, approaching or surpassing the strongest baseline in many conditions. The findings support a design framework prioritizing goal-persistence, calibrated conciseness, and an RTC-enabled embodied interface to deliver trustworthy, actionable assistance in real-world VIA scenarios. The work offers concrete design principles for implicit ability-awareness, situated guidance, and adaptive reasoning in VIA systems.

Abstract

Vision-Language Models (VLMs) enable on-demand visual assistance, yet current applications for people with visual impairments (PVI) impose high cognitive load and exhibit task drift, limiting real-world utility. We first conducted a formative study with 15 PVI and identified three requirements for visually impaired assistance (VIA): low latency for real-time use, minimal cognitive load, and hallucination-resistant responses to sustain trust. Informed by the formative study, we present VIA-Agent, a prototype that co-optimizes its cognitive 'brain' and interactive 'body'. The brain implements a goal-persistent design with calibrated conciseness to produce brief, actionable guidance; the body adopts a real-time communication (RTC) embodiment-evolving from a request-response model Context Protocol (MCP) pipeline-to-support fluid interaction. We evaluated VIA-Agent with 9 PVI across navigation and object retrieval in the wild against BeMyAI and Doubao. VIA-Agent significantly outperformed BeMyAI both quantitatively and qualitatively. While achieving success rates comparable to Doubao, it reduced mean task time by 39.9% (70.1 s vs. 110.7 s), required fewer conversational turns (4.3 vs. 5.0), and lowered perceived cognitive load and task drift. System Usability Scale (SUS) results aligned with these findings, with VIA-Agent achieving the highest usability. We hope this work inspires the development of more human-centered VIA systems.

"Less is More": Reducing Cognitive Load and Task Drift in Real-Time Multimodal Assistive Agents for the Visually Impaired

TL;DR

Abstract

"Less is More": Reducing Cognitive Load and Task Drift in Real-Time Multimodal Assistive Agents for the Visually Impaired

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)