Table of Contents
Fetching ...

Say It My Way: Exploring Control in Conversational Visual Question Answering with Blind Users

Farnaz Zamiri Zeraati, Yang Trista Cao, Yuehan Qiao, Hal Daumé, Hernisa Kacorri

TL;DR

This work analyzes prompting-based techniques participants adopted, including those introduced in the study and those developed independently in real-world settings, and offers insights for interaction design at both query and system levels.

Abstract

Prompting and steering techniques are well established in general-purpose generative AI, yet assistive visual question answering (VQA) tools for blind users still follow rigid interaction patterns with limited opportunities for customization. User control can be helpful when system responses are misaligned with their goals and contexts, a gap that becomes especially consequential for blind users that may rely on these systems for access. We invite 11 blind users to customize their interactions with a real-world conversational VQA system. Drawing on 418 interactions, reflections, and post-study interviews, we analyze prompting-based techniques participants adopted, including those introduced in the study and those developed independently in real-world settings. VQA interactions were often lengthy: participants averaged 3 turns, sometimes up to 21, with input text typically tenfold shorter than the responses they heard. Built on state-of-the-art LLMs, the system lacked verbosity controls, was limited in estimating distance in space and time, relied on inaccessible image framing, and offered little to no camera guidance. We discuss how customization techniques such as prompt engineering can help participants work around these limitations. Alongside a new publicly available dataset, we offer insights for interaction design at both query and system levels.

Say It My Way: Exploring Control in Conversational Visual Question Answering with Blind Users

TL;DR

This work analyzes prompting-based techniques participants adopted, including those introduced in the study and those developed independently in real-world settings, and offers insights for interaction design at both query and system levels.

Abstract

Prompting and steering techniques are well established in general-purpose generative AI, yet assistive visual question answering (VQA) tools for blind users still follow rigid interaction patterns with limited opportunities for customization. User control can be helpful when system responses are misaligned with their goals and contexts, a gap that becomes especially consequential for blind users that may rely on these systems for access. We invite 11 blind users to customize their interactions with a real-world conversational VQA system. Drawing on 418 interactions, reflections, and post-study interviews, we analyze prompting-based techniques participants adopted, including those introduced in the study and those developed independently in real-world settings. VQA interactions were often lengthy: participants averaged 3 turns, sometimes up to 21, with input text typically tenfold shorter than the responses they heard. Built on state-of-the-art LLMs, the system lacked verbosity controls, was limited in estimating distance in space and time, relied on inaccessible image framing, and offered little to no camera guidance. We discuss how customization techniques such as prompt engineering can help participants work around these limitations. Alongside a new publicly available dataset, we offer insights for interaction design at both query and system levels.
Paper Structure (47 sections, 8 figures, 3 tables)

This paper contains 47 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Our multi-day study includes an in-lab interview with scenarios, a 10-day diary, and a remote interview.
  • Figure 2: Lengths of user inputs and system responses (measured in word count) across in-lab session and diary session interactions.
  • Figure 3: Example interaction consisting of five conversational turns with word count for each turn.
  • Figure 4: Distribution of interactions across scenarios, with colors indicating presence, absence, or uncertainty and intensity marking customization. Proportionally, interactions involved more customization in familiar environments and with familiar items, slightly more around others than alone, similar indoors and outdoors, and less when in hurry.
  • Figure 5: All the customization techniques observed in Ask&Prompt dataset following the hierarchy from schulhoff2025promptreportsystematicsurvey. The techniques with solid border are introduced in the lab. Techniques with asterisk come up by participants in the diaries. Techniques with dash border are customization techniques that participants' explored strategies resemble. Due to the limitation of Be My AI, we use image-as-text prompting instead of few-shot prompting to provide information.
  • ...and 3 more figures