Table of Contents
Fetching ...

Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations

Ricardo Gonzalez Penuela, Felipe Arias-Russi, Victor Capriles

TL;DR

This paper tackles how multimodal large language models can provide proactive, contextually relevant visual descriptions for Blind and Low Vision users by leveraging historical BLV questions. It introduces a retrieval-augmented approach that pulls semantically similar past VizWiz-LF image-question pairs to guide description generation, evaluating both context-aware and context-free conditions. Results indicate context-aware descriptions are more accurate (76.1% vs 63.0%), anticipate user questions in a subset of cases (15.2%), and are preferred in the majority of comparisons (54.3%). The work demonstrates the value of historical user queries as a signal for improving BLV assistance and outlines pathways for scaling, weighting retrievals by similarity, and personalizing context for individual users.

Abstract

Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their accuracy and ability to provide rich, human-like interpretations. However, these applications often default to comprehensive, lengthy descriptions regardless of context. This leads to inefficient exchanges, as users must go through irrelevant details rather than receiving the specific information they are likely to seek. To deliver more contextually-relevant information, we developed a system that draws on historical BLV users questions. When given an image, our system identifies similar past visual contexts from the VizWiz-LF dataset and uses the associated questions to guide the MLLM generate descriptions more relevant to BLV users. An evaluation with three human labelers who revised 92 context-aware and context-free descriptions showed that context-aware descriptions anticipated and answered users' questions in 76.1% of cases (70 out of 92) and were preferred in 54.4% of comparisons (50 out of 92). Our paper reviews, and data analysis are publicly available in a Github repository at https://github.com/rgonzalezp/guiding-multimodal-large-language-models-with-blind-and-low-vision-people-visual-questions .

Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations

TL;DR

This paper tackles how multimodal large language models can provide proactive, contextually relevant visual descriptions for Blind and Low Vision users by leveraging historical BLV questions. It introduces a retrieval-augmented approach that pulls semantically similar past VizWiz-LF image-question pairs to guide description generation, evaluating both context-aware and context-free conditions. Results indicate context-aware descriptions are more accurate (76.1% vs 63.0%), anticipate user questions in a subset of cases (15.2%), and are preferred in the majority of comparisons (54.3%). The work demonstrates the value of historical user queries as a signal for improving BLV assistance and outlines pathways for scaling, weighting retrievals by similarity, and personalizing context for individual users.

Abstract

Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their accuracy and ability to provide rich, human-like interpretations. However, these applications often default to comprehensive, lengthy descriptions regardless of context. This leads to inefficient exchanges, as users must go through irrelevant details rather than receiving the specific information they are likely to seek. To deliver more contextually-relevant information, we developed a system that draws on historical BLV users questions. When given an image, our system identifies similar past visual contexts from the VizWiz-LF dataset and uses the associated questions to guide the MLLM generate descriptions more relevant to BLV users. An evaluation with three human labelers who revised 92 context-aware and context-free descriptions showed that context-aware descriptions anticipated and answered users' questions in 76.1% of cases (70 out of 92) and were preferred in 54.4% of comparisons (50 out of 92). Our paper reviews, and data analysis are publicly available in a Github repository at https://github.com/rgonzalezp/guiding-multimodal-large-language-models-with-blind-and-low-vision-people-visual-questions .

Paper Structure

This paper contains 15 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Sometimes both context-aware and context-free descriptions hallucinated information and thus did not respond the user question accurately.