Table of Contents
Fetching ...

Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals

Shruti Singh Baghel, Yash Pratap Singh Rathore, Sushovan Jena, Anurag Pradhan, Amit Shukla, Arnav Bhavsar, Pawan Goyal

TL;DR

The paper tackles the accessibility gap for blind and low-vision users by comparing lightweight SmolVLM2 variants (500M and 2.2B) on indoor Charades and outdoor AVCaps, and by introducing two BLV-specific evaluation frameworks. It combines four prompting strategies with professional audio-description guidelines to generate descriptions, evaluated against ground truth via offline, BLV-focused metrics using GPT-OSS-20B. Ground truth is produced with a 42-guideline AD protocol, and a 3–4 keyframe adaptive sampling method supports efficient on-device processing. The study demonstrates that smaller models can match or exceed larger ones in certain BLV contexts, validates edge deployment on consumer hardware with FP32/INT8, and delivers practical pathways toward private, on-device, real-time BLV video accessibility. The work thus advances democratized BLV video accessibility by aligning model capabilities with on-device constraints and user-centric evaluation.

Abstract

Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.

Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals

TL;DR

The paper tackles the accessibility gap for blind and low-vision users by comparing lightweight SmolVLM2 variants (500M and 2.2B) on indoor Charades and outdoor AVCaps, and by introducing two BLV-specific evaluation frameworks. It combines four prompting strategies with professional audio-description guidelines to generate descriptions, evaluated against ground truth via offline, BLV-focused metrics using GPT-OSS-20B. Ground truth is produced with a 42-guideline AD protocol, and a 3–4 keyframe adaptive sampling method supports efficient on-device processing. The study demonstrates that smaller models can match or exceed larger ones in certain BLV contexts, validates edge deployment on consumer hardware with FP32/INT8, and delivers practical pathways toward private, on-device, real-time BLV video accessibility. The work thus advances democratized BLV video accessibility by aligning model capabilities with on-device constraints and user-centric evaluation.

Abstract

Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.

Paper Structure

This paper contains 13 sections, 2 equations, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Experimental Design Overview: Four prompting strategies evaluated across SmolVLM variants and reference model (Qwen). The diagram illustrates progressive complexity from baseline prompt-only approach to comprehensive prompt with context and audio-description guidelines integration. Each strategy generates descriptions that are evaluated against ground truth using both standard NLP metrics and custom accessibility metrics designed for BLV users.