Table of Contents
Fetching ...

ADCanvas: Accessible and Conversational Audio Description Authoring for Blind and Low Vision Creators

Franklin Mingzhe Li, Michael Xieyang Liu, Cynthia L. Bennett, Shaun K. Kane

TL;DR

ADCanvas reimagines audio description authoring for blind and low-vision creators by providing a screen-reader–friendly, non-visual workflow that combines a WebVTT editor, keyboard controls, and an instruction-based multimodal AI agent for live VQA and drafting. In a study with 12 BLV creators, the system enabled independent AD authoring, demonstrated as an information conduit, drafting assistant, and co-author, while revealing design needs around trust, verification, and fine-grained control. The work contributes empirical insights into human-AI co-creation in non-visual media, along with design implications for agent configurability, interaction modes, and accessibility-focused workflow integration. It highlights a path toward more autonomous, yet human-centered, AI-assisted AD tools that preserve professional standards and creative agency for BLV practitioners. As AI tools evolve, ADCanvas advocates for transparent, configurable, and co-creative interfaces that expand accessibility without undermining expert practice.

Abstract

Audio Description (AD) provides essential access to visual media for blind and low vision (BLV) audiences. Yet current AD production tools remain largely inaccessible to BLV video creators, who possess valuable expertise but face barriers due to visually-driven interfaces. We present ADCanvas, a multimodal authoring system that supports non-visual control over audio description (AD) creation. ADCanvas combines conversational interaction with keyboard-based playback control and a plain-text, screen reader-accessible editor to support end-to-end AD authoring and visual question answering (VQA). Combining screen-reader-friendly controls with a multimodal LLM agent, ADCanvas supports live VQA, script generation, and AD modification. Through a user study with 12 BLV video creators, we find that users adopt the conversational agent as an informational aide and drafting assistant, while maintaining agency through verification and editing. For example, participants saw themselves as curators who received information from the model and filtered it down for their audience. Our findings offer design implications for accessible media tools, including precise editing controls, accessibility support for creative ideation, and configurable rules for human-AI collaboration.

ADCanvas: Accessible and Conversational Audio Description Authoring for Blind and Low Vision Creators

TL;DR

ADCanvas reimagines audio description authoring for blind and low-vision creators by providing a screen-reader–friendly, non-visual workflow that combines a WebVTT editor, keyboard controls, and an instruction-based multimodal AI agent for live VQA and drafting. In a study with 12 BLV creators, the system enabled independent AD authoring, demonstrated as an information conduit, drafting assistant, and co-author, while revealing design needs around trust, verification, and fine-grained control. The work contributes empirical insights into human-AI co-creation in non-visual media, along with design implications for agent configurability, interaction modes, and accessibility-focused workflow integration. It highlights a path toward more autonomous, yet human-centered, AI-assisted AD tools that preserve professional standards and creative agency for BLV practitioners. As AI tools evolve, ADCanvas advocates for transparent, configurable, and co-creative interfaces that expand accessibility without undermining expert practice.

Abstract

Audio Description (AD) provides essential access to visual media for blind and low vision (BLV) audiences. Yet current AD production tools remain largely inaccessible to BLV video creators, who possess valuable expertise but face barriers due to visually-driven interfaces. We present ADCanvas, a multimodal authoring system that supports non-visual control over audio description (AD) creation. ADCanvas combines conversational interaction with keyboard-based playback control and a plain-text, screen reader-accessible editor to support end-to-end AD authoring and visual question answering (VQA). Combining screen-reader-friendly controls with a multimodal LLM agent, ADCanvas supports live VQA, script generation, and AD modification. Through a user study with 12 BLV video creators, we find that users adopt the conversational agent as an informational aide and drafting assistant, while maintaining agency through verification and editing. For example, participants saw themselves as curators who received information from the model and filtered it down for their audience. Our findings offer design implications for accessible media tools, including precise editing controls, accessibility support for creative ideation, and configurable rules for human-AI collaboration.
Paper Structure (50 sections, 3 figures, 4 tables)

This paper contains 50 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Conversations between Clara and the agent when asking details about the “PAREIDOLIA” video. a1) Clara asked the agent to provide a brief summary. a2) The agent responded with text showing the man appears distressed. b1) Clara continued to ask what visuals show the man is looking distressed. b2) The agent tells details regarding where and why the man looks distressed.
  • Figure 2: Context-aware interaction between the AD editor, video trigger, and the conversational agent. 1) The user selects the line starting from 0 minute 8 seconds. 2) The video cursor directly jumps to the start time of this scene. 3) The user asks the agent anything special about the slippers contextually. 3a) Already made aware of the timestamp that the user is focused on, the agent responded with the corresponding details of the slippers.
  • Figure 3: Global modification through the conversational agent. The participant wanted to know the man’s name (a), then decided to update the script with the man’s name by asking the conversational agent to update "the man,” "the guy,” and "a man” to the man’s name (b). The agent made the global edit accordingly (c). Since it has a contextual understanding of the video/script, it did more than a string replacement.