Table of Contents
Fetching ...

Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance

Xiangxiang Wang, Xuanyu Wang, YiJia Luo, Yongbin Yu, Manping Fan, Jingtao Zhang, Liyong Ren

TL;DR

The paper addresses the impractical memory and latency of large vision-language models for visually impaired assistance by introducing a cross-modal differentiated quantization (CMDQ) framework and a scene-aware vectorized memory multi-agent system. CMDQ tailors quantization to visual encoders and cross-modal modules with modular calibration and bit-packed dequantization to reduce memory from 38 GB to 11.3 GB while preserving multimodal performance; the multi-agent system leverages a flow-based architecture, vectorized memory via RAG, and streaming speech to deliver context-rich environmental understanding beyond the current view with latency around 3 seconds. Experiments on MMBench and OCR-VQA show only minor accuracy losses (≈2% on MMBench, ≈1–2% on OCR-VQA) and significant memory reductions, while practical evaluations demonstrate robust scene classification, memory retrieval, and low-latency interaction. This integrated approach enables real-time, scene-aware assistance on consumer hardware, advancing practical deployment of VLMs for visually impaired users and suggesting paths toward mixed-precision quantization and expanded real-world testing.

Abstract

Visually impaired individuals face significant challenges in environmental perception. Traditional assistive technologies often lack adaptive intelligence, focusing on individual components rather than integrated systems. While Vision-Language Models (VLMs) offer a promising path to richer, integrated understanding, their deployment is severely limited by substantial computational requirements, demanding dozens of gigabytes of memory. To address these gaps in computational efficiency and integrated design, this study proposes a dual technological innovation framework: a cross-modal differentiated quantization framework for VLMs and a scene-aware vectorized memory multi-agent system. The quantization framework implements differentiated strategies, reducing memory from 38GB to 11.3GB. The multi-agent system uses vectorized memory and perception-memory-reasoning workflows to provide environmental information beyond the current view, achieving 2.83-3.52s latency to initial speech output. Experiments show the quantized 19B-parameter model only experiences a 2.05% performance drop on MMBench and maintains 63.7 accuracy on OCR-VQA (original: 64.9), outperforming smaller models with equivalent memory. This research advances computational efficiency and assistive technology, offering comprehensive assistance in scene perception, text recognition, and navigation.

Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance

TL;DR

The paper addresses the impractical memory and latency of large vision-language models for visually impaired assistance by introducing a cross-modal differentiated quantization (CMDQ) framework and a scene-aware vectorized memory multi-agent system. CMDQ tailors quantization to visual encoders and cross-modal modules with modular calibration and bit-packed dequantization to reduce memory from 38 GB to 11.3 GB while preserving multimodal performance; the multi-agent system leverages a flow-based architecture, vectorized memory via RAG, and streaming speech to deliver context-rich environmental understanding beyond the current view with latency around 3 seconds. Experiments on MMBench and OCR-VQA show only minor accuracy losses (≈2% on MMBench, ≈1–2% on OCR-VQA) and significant memory reductions, while practical evaluations demonstrate robust scene classification, memory retrieval, and low-latency interaction. This integrated approach enables real-time, scene-aware assistance on consumer hardware, advancing practical deployment of VLMs for visually impaired users and suggesting paths toward mixed-precision quantization and expanded real-world testing.

Abstract

Visually impaired individuals face significant challenges in environmental perception. Traditional assistive technologies often lack adaptive intelligence, focusing on individual components rather than integrated systems. While Vision-Language Models (VLMs) offer a promising path to richer, integrated understanding, their deployment is severely limited by substantial computational requirements, demanding dozens of gigabytes of memory. To address these gaps in computational efficiency and integrated design, this study proposes a dual technological innovation framework: a cross-modal differentiated quantization framework for VLMs and a scene-aware vectorized memory multi-agent system. The quantization framework implements differentiated strategies, reducing memory from 38GB to 11.3GB. The multi-agent system uses vectorized memory and perception-memory-reasoning workflows to provide environmental information beyond the current view, achieving 2.83-3.52s latency to initial speech output. Experiments show the quantized 19B-parameter model only experiences a 2.05% performance drop on MMBench and maintains 63.7 accuracy on OCR-VQA (original: 64.9), outperforming smaller models with equivalent memory. This research advances computational efficiency and assistive technology, offering comprehensive assistance in scene perception, text recognition, and navigation.

Paper Structure

This paper contains 33 sections, 25 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Illustration of the modality-specific module partitioning strategy for quantization in VLMs. The figure depicts the process of collecting calibration data using VisionCatcher and MultimodalCatcher modules, followed by the quantization of vision encoder layers and cross-modal processing modules separately. The modular quantization approach ensures that each functional module undergoes independent quantization, reducing interference and maintaining model performance across different modalities.
  • Figure 2: Efficient dequantization computation and storage optimization architecture. The diagram illustrates the bit-packed storage format (bottom left) with multiple weights packed into 32-bit integers, the dequantization process (center) with unpacking, scaling, and zero-point adjustment operations, and the matrix multiplication workflow (right) that enables efficient computation. This optimized architecture reduces memory requirements while maintaining computational efficiency for large vision-language models.
  • Figure 3: Flow-based Multi-agent Visual Assistance Framework. The architecture integrates Perception (scene capture), Memory RAG (similarity-based scene retrieval), and Deliberation (analysis and interaction) modules, all powered by VLM. The system optimizes processing by directly reusing analysis for high-similarity scenes and leveraging historical data for medium-similarity scenes, creating an efficient closed-loop system for visually impaired users.
  • Figure 4: Performance comparison of different VLMs on benchmarks. (a) MMBench v1.1 TEST results. (b) OCRVQA_TESTCORE results.
  • Figure 5: Example of environmental description scene processing workflow. The system successfully identified the environmental description type in the scene transition from corridor to conference room and extracted key layout information. The figure shows the complete perception-analysis-execution process, with detailed analysis results presented in Table \ref{['tab:description_scene_details']}.
  • ...and 4 more figures