Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance

Xiangxiang Wang; Xuanyu Wang; YiJia Luo; Yongbin Yu; Manping Fan; Jingtao Zhang; Liyong Ren

Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance

Xiangxiang Wang, Xuanyu Wang, YiJia Luo, Yongbin Yu, Manping Fan, Jingtao Zhang, Liyong Ren

TL;DR

The paper addresses the impractical memory and latency of large vision-language models for visually impaired assistance by introducing a cross-modal differentiated quantization (CMDQ) framework and a scene-aware vectorized memory multi-agent system. CMDQ tailors quantization to visual encoders and cross-modal modules with modular calibration and bit-packed dequantization to reduce memory from 38 GB to 11.3 GB while preserving multimodal performance; the multi-agent system leverages a flow-based architecture, vectorized memory via RAG, and streaming speech to deliver context-rich environmental understanding beyond the current view with latency around 3 seconds. Experiments on MMBench and OCR-VQA show only minor accuracy losses (≈2% on MMBench, ≈1–2% on OCR-VQA) and significant memory reductions, while practical evaluations demonstrate robust scene classification, memory retrieval, and low-latency interaction. This integrated approach enables real-time, scene-aware assistance on consumer hardware, advancing practical deployment of VLMs for visually impaired users and suggesting paths toward mixed-precision quantization and expanded real-world testing.

Abstract

Visually impaired individuals face significant challenges in environmental perception. Traditional assistive technologies often lack adaptive intelligence, focusing on individual components rather than integrated systems. While Vision-Language Models (VLMs) offer a promising path to richer, integrated understanding, their deployment is severely limited by substantial computational requirements, demanding dozens of gigabytes of memory. To address these gaps in computational efficiency and integrated design, this study proposes a dual technological innovation framework: a cross-modal differentiated quantization framework for VLMs and a scene-aware vectorized memory multi-agent system. The quantization framework implements differentiated strategies, reducing memory from 38GB to 11.3GB. The multi-agent system uses vectorized memory and perception-memory-reasoning workflows to provide environmental information beyond the current view, achieving 2.83-3.52s latency to initial speech output. Experiments show the quantized 19B-parameter model only experiences a 2.05% performance drop on MMBench and maintains 63.7 accuracy on OCR-VQA (original: 64.9), outperforming smaller models with equivalent memory. This research advances computational efficiency and assistive technology, offering comprehensive assistance in scene perception, text recognition, and navigation.

Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance

TL;DR

Abstract

Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)