Making Short-Form Videos Accessible with Hierarchical Video Summaries
Tess Van Daele, Akhil Iyer, Yuning Zhang, Jalyn C. Derry, Mina Huh, Amy Pavel
TL;DR
This paper tackles the BLV accessibility gap in short-form videos by introducing ShortScribe, a system that generates hierarchical video descriptions through a multi-modal pipeline (ASR, OCR, BLIP-2, CLIP, GPT-4). It provides three levels of descriptions (short, long, and shot-by-shot) plus on-screen text, enabling BLV users to quickly skim content and progressively access details. A formative study informs design and confirms a need for on-demand, non-time-aligned descriptions tailored to fast-scrolling feeds, while a within-subject user study with 10 BLV participants shows significant gains in video comprehension and accuracy when using ShortScribe versus a baseline. The work demonstrates the feasibility and utility of hierarchical, AI-generated descriptions for short-form video accessibility and discusses platform-level recommendations and future work to extend to other video formats and domains.
Abstract
Short videos on platforms such as TikTok, Instagram Reels, and YouTube Shorts (i.e. short-form videos) have become a primary source of information and entertainment. Many short-form videos are inaccessible to blind and low vision (BLV) viewers due to their rapid visual changes, on-screen text, and music or meme-audio overlays. In our formative study, 7 BLV viewers who regularly watched short-form videos reported frequently skipping such inaccessible content. We present ShortScribe, a system that provides hierarchical visual summaries of short-form videos at three levels of detail to support BLV viewers in selecting and understanding short-form videos. ShortScribe allows BLV users to navigate between video descriptions based on their level of interest. To evaluate ShortScribe, we assessed description accuracy and conducted a user study with 10 BLV participants comparing ShortScribe to a baseline interface. When using ShortScribe, participants reported higher comprehension and provided more accurate summaries of video content.
