SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers

Zheng Ning; Brianna L. Wimer; Kaiwen Jiang; Keyi Chen; Jerrick Ban; Yapeng Tian; Yuhang Zhao; Toby Jia-Jun Li

SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers

Zheng Ning, Brianna L. Wimer, Kaiwen Jiang, Keyi Chen, Jerrick Ban, Yapeng Tian, Yuhang Zhao, Toby Jia-Jun Li

TL;DR

This work tackles the limitations of static audio descriptions for blind and low-vision viewers by introducing SPICA, an AI-powered system that augments ADs with interactive, layer-based descriptions, spatialized sounds, and high-contrast visual cues. SPICA combines a multi-module ML pipeline (scene analysis, object detection, object-description generation, and depth-aware sound retrieval) with a user-focused frontend to enable temporal navigation of keyframes and spatial exploration of objects within frames. A within-subjects user study with 14 BLV participants shows that SPICA improves understanding and immersion compared to conventional ADs, and technical benchmarks demonstrate high precision in object labeling and superior quality of object-level descriptions. The findings offer practical design guidance for multisensory, interaction-based accessibility tools and point toward adaptive, personalized, and longer-form video experiences for BLV users, with potential extensions to group viewing and VQA-enabled content.

Abstract

Blind or Low-Vision (BLV) users often rely on audio descriptions (AD) to access video content. However, conventional static ADs can leave out detailed information in videos, impose a high mental load, neglect the diverse needs and preferences of BLV users, and lack immersion. To tackle these challenges, we introduce SPICA, an AI-powered system that enables BLV users to interactively explore video content. Informed by prior empirical studies on BLV video consumption, SPICA offers novel interactive mechanisms for supporting temporal navigation of frame captions and spatial exploration of objects within key frames. Leveraging an audio-visual machine learning pipeline, SPICA augments existing ADs by adding interactivity, spatial sound effects, and individual object descriptions without requiring additional human annotation. Through a user study with 14 BLV participants, we evaluated the usability and usefulness of SPICA and explored user behaviors, preferences, and mental models when interacting with augmented ADs.

SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers

TL;DR

Abstract

Paper Structure (49 sections, 4 figures, 6 tables)

This paper contains 49 sections, 4 figures, 6 tables.

Introduction
Related Work
Audio Descriptions Generation
Interactive Exploration for Visual Content
BLV User Engagement with Visual Content
SPICA System
Example Workflow
System Overview
Temporal Navigation on Visual Scenes
Spatial Exploration of Objects within Frames
Keyframe Detection and Description Generation Pipeline
Implementation
Technical Evaluation
Dataset
Procedure
...and 34 more sections

Figures (4)

Figure 1: The main interface of Spica. Spica provides interactivity for BLV users to explore the video. A) The video player. Users can explore objects in the frame using fingers or keyboard arrow keys. The object would be highlighted with a high-contrast color mask if selected. B) Frame-Level Caption List. Users can use arrow keys on the keyboard to navigate to different visual scenes. C) Object-Level Caption List. Users can use arrow keys to scan through objects. Once the object is selected, a spatial sound effect associated with the object will be played based on its estimated 3D position.
Figure 2: The machine learning pipeline used in Spica.
Figure 3: Participants' ratings to the usability and usefulness questions to Spica
Figure 4: User behaviors of a sample video (V6). A total of 6 participants (P1, P4, P5, P8, P10, and P11) experienced this video using Spica

SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers

TL;DR

Abstract

SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers

Authors

TL;DR

Abstract

Table of Contents

Figures (4)