Table of Contents
Fetching ...

"Less is More": Reducing Cognitive Load and Task Drift in Real-Time Multimodal Assistive Agents for the Visually Impaired

Yi Zhao, Siqi Wang, Qiqun Geng, Erxin Yu, Jing Li

TL;DR

This work tackles high cognitive load and task drift in Vision–Language Model (VLM)–based assistive AI for the visually impaired. It introduces VIA-Agent, a brain–body system that co-optimizes a VIA-specialized VLM core for concise, goal-focused guidance with a low-latency Real-Time Communication (RTC) embodiment for fluid interaction. Through a formative study with 15 participants and a user evaluation with 9 participants against BeMyAI and Doubao, VIA-Agent demonstrates improved efficiency, reduced cognitive load and task drift, and higher usability, approaching or surpassing the strongest baseline in many conditions. The findings support a design framework prioritizing goal-persistence, calibrated conciseness, and an RTC-enabled embodied interface to deliver trustworthy, actionable assistance in real-world VIA scenarios. The work offers concrete design principles for implicit ability-awareness, situated guidance, and adaptive reasoning in VIA systems.

Abstract

Vision-Language Models (VLMs) enable on-demand visual assistance, yet current applications for people with visual impairments (PVI) impose high cognitive load and exhibit task drift, limiting real-world utility. We first conducted a formative study with 15 PVI and identified three requirements for visually impaired assistance (VIA): low latency for real-time use, minimal cognitive load, and hallucination-resistant responses to sustain trust. Informed by the formative study, we present VIA-Agent, a prototype that co-optimizes its cognitive 'brain' and interactive 'body'. The brain implements a goal-persistent design with calibrated conciseness to produce brief, actionable guidance; the body adopts a real-time communication (RTC) embodiment-evolving from a request-response model Context Protocol (MCP) pipeline-to-support fluid interaction. We evaluated VIA-Agent with 9 PVI across navigation and object retrieval in the wild against BeMyAI and Doubao. VIA-Agent significantly outperformed BeMyAI both quantitatively and qualitatively. While achieving success rates comparable to Doubao, it reduced mean task time by 39.9% (70.1 s vs. 110.7 s), required fewer conversational turns (4.3 vs. 5.0), and lowered perceived cognitive load and task drift. System Usability Scale (SUS) results aligned with these findings, with VIA-Agent achieving the highest usability. We hope this work inspires the development of more human-centered VIA systems.

"Less is More": Reducing Cognitive Load and Task Drift in Real-Time Multimodal Assistive Agents for the Visually Impaired

TL;DR

This work tackles high cognitive load and task drift in Vision–Language Model (VLM)–based assistive AI for the visually impaired. It introduces VIA-Agent, a brain–body system that co-optimizes a VIA-specialized VLM core for concise, goal-focused guidance with a low-latency Real-Time Communication (RTC) embodiment for fluid interaction. Through a formative study with 15 participants and a user evaluation with 9 participants against BeMyAI and Doubao, VIA-Agent demonstrates improved efficiency, reduced cognitive load and task drift, and higher usability, approaching or surpassing the strongest baseline in many conditions. The findings support a design framework prioritizing goal-persistence, calibrated conciseness, and an RTC-enabled embodied interface to deliver trustworthy, actionable assistance in real-world VIA scenarios. The work offers concrete design principles for implicit ability-awareness, situated guidance, and adaptive reasoning in VIA systems.

Abstract

Vision-Language Models (VLMs) enable on-demand visual assistance, yet current applications for people with visual impairments (PVI) impose high cognitive load and exhibit task drift, limiting real-world utility. We first conducted a formative study with 15 PVI and identified three requirements for visually impaired assistance (VIA): low latency for real-time use, minimal cognitive load, and hallucination-resistant responses to sustain trust. Informed by the formative study, we present VIA-Agent, a prototype that co-optimizes its cognitive 'brain' and interactive 'body'. The brain implements a goal-persistent design with calibrated conciseness to produce brief, actionable guidance; the body adopts a real-time communication (RTC) embodiment-evolving from a request-response model Context Protocol (MCP) pipeline-to-support fluid interaction. We evaluated VIA-Agent with 9 PVI across navigation and object retrieval in the wild against BeMyAI and Doubao. VIA-Agent significantly outperformed BeMyAI both quantitatively and qualitatively. While achieving success rates comparable to Doubao, it reduced mean task time by 39.9% (70.1 s vs. 110.7 s), required fewer conversational turns (4.3 vs. 5.0), and lowered perceived cognitive load and task drift. System Usability Scale (SUS) results aligned with these findings, with VIA-Agent achieving the highest usability. We hope this work inspires the development of more human-centered VIA systems.

Paper Structure

This paper contains 52 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Core design principle and comparative positioning of VIA-Agent. (a) VIA-Agent co-optimizes the 'Brain' (a VIA-specialized VLM) and the 'Body' (a real-time interaction embodiment) to deliver concise, actionable guidance for people with visual impairments. (b) We iterated from a wearable, request-response-based form factor to a mobile live-chat app, culminating in VIA-Agent, which provides effective and seamless assistance. This positioning distinguishes VIA-Agent from existing solutions such as the inefficient request-response BeMyAI bemyai_web_2025 and the general-purpose Doubao doubao_web_2025.
  • Figure 2: The VIA-Agent Core's architecture and iterative refinement. The VIA-Agent Core (left) specifies the agent’s cognitive model—its guiding principles, a five-step reasoning workflow, and task-specific demonstrations for in-context learning. This model is then optimized through an iterative refinement loop (right), where user feedback from task execution is systematically evaluated to update the agent's policy, progressively enhancing its effectiveness.
  • Figure 3: The architectural evolution of the VIA-Agent embodiment. The initial MCP-based implementation (Left) operates on a discrete, request-response workflow using a dedicated frontier device. The iterated RTC-based implementation (Right) transitions to a mobile app, enabling continuous video and audio streaming for low-latency, real-time interaction.
  • Figure 4: Design parameters and procedural logic of the VIA-Agent Core. The figure details the agent's static configuration, specifying the base VLM, input/output constraints (e.g., a 128-token response limit, two-round context window), and memory settings. It also outlines the dynamic operational logic for critical steps within the Thinking Workflow, namely the goal re-evaluation process (Step 1) and the multi-level confidence filtering schema (Step 4).
  • Figure 5: Architectural Evolution of the VIA-Agent Embodiment. The system progressed from an Initial Prototype: Wearable MCP Device (Left), which used embedded hardware components(ESP32-S3, OV3660 camera) for a discrete request-response workflow. The final design, the RTC Mobile App (Right), overcomes latency issues by leveraging continuous Real-Time Communication (RTC) streaming, achieving low latency and seamless usability in human-AI interaction.
  • ...and 5 more figures