ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos

Maryam Cheema; Sina Elahimanesh; Pooyan Fazli; Hasti Seifi

ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos

Maryam Cheema, Sina Elahimanesh, Pooyan Fazli, Hasti Seifi

Abstract

Advances in multimodal large language models enable automatic video narration and question answering (VQA), offering scalable alternatives to labor-intensive, human-authored audio descriptions (ADs) for blind and low vision (BLV) viewers. However, prior AI-driven AD systems rarely adapt to the diverse needs and preferences of BLV individuals across videos and are typically evaluated in controlled, single-session settings. We present ViDscribe, a web-based platform that integrates AI-generated ADs with six types of user customizations and a conversational VQA interface for YouTube videos. Through a longitudinal, in-the-wild study with eight BLV participants, we examine how users engage with customization and VQA features over time. Our results show sustained engagement with both features and that customized ADs improve effectiveness, enjoyment, and immersion compared to default ADs, highlighting the value of personalized, interactive video access for BLV users.

ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos

Abstract

Paper Structure (14 sections, 3 figures, 1 table)

This paper contains 14 sections, 3 figures, 1 table.

Introduction
Related Work
ViDscribe
User Study
Results
Discussion and Conclusion
Appendix: Prompt Templates Used in ViDscribe
Base Audio Description Prompt
General Guidelines
Customization Prompt: Description Subjectivity
Customization Prompt: Color Preference
Customization Prompt: Description Emphasis
Interactive Visual Question Answering Prompt
Codebook of VQA Types with Definitions and Examples

Figures (3)

Figure 1: ViDscribe interface, showing customization controls, video page with adaptive ADs, and interactive VQA during playback.
Figure 2: User ratings for (a) customized vs. default ADs in daily surveys, and (b) ViDscribe in the end-of-week survey.
Figure 3: Distribution of VQA types and response accuracy.

ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos

Abstract

ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos

Authors

Abstract

Table of Contents

Figures (3)