Making Short-Form Videos Accessible with Hierarchical Video Summaries

Tess Van Daele; Akhil Iyer; Yuning Zhang; Jalyn C. Derry; Mina Huh; Amy Pavel

Making Short-Form Videos Accessible with Hierarchical Video Summaries

Tess Van Daele, Akhil Iyer, Yuning Zhang, Jalyn C. Derry, Mina Huh, Amy Pavel

TL;DR

This paper tackles the BLV accessibility gap in short-form videos by introducing ShortScribe, a system that generates hierarchical video descriptions through a multi-modal pipeline (ASR, OCR, BLIP-2, CLIP, GPT-4). It provides three levels of descriptions (short, long, and shot-by-shot) plus on-screen text, enabling BLV users to quickly skim content and progressively access details. A formative study informs design and confirms a need for on-demand, non-time-aligned descriptions tailored to fast-scrolling feeds, while a within-subject user study with 10 BLV participants shows significant gains in video comprehension and accuracy when using ShortScribe versus a baseline. The work demonstrates the feasibility and utility of hierarchical, AI-generated descriptions for short-form video accessibility and discusses platform-level recommendations and future work to extend to other video formats and domains.

Abstract

Short videos on platforms such as TikTok, Instagram Reels, and YouTube Shorts (i.e. short-form videos) have become a primary source of information and entertainment. Many short-form videos are inaccessible to blind and low vision (BLV) viewers due to their rapid visual changes, on-screen text, and music or meme-audio overlays. In our formative study, 7 BLV viewers who regularly watched short-form videos reported frequently skipping such inaccessible content. We present ShortScribe, a system that provides hierarchical visual summaries of short-form videos at three levels of detail to support BLV viewers in selecting and understanding short-form videos. ShortScribe allows BLV users to navigate between video descriptions based on their level of interest. To evaluate ShortScribe, we assessed description accuracy and conducted a user study with 10 BLV participants comparing ShortScribe to a baseline interface. When using ShortScribe, participants reported higher comprehension and provided more accurate summaries of video content.

Making Short-Form Videos Accessible with Hierarchical Video Summaries

TL;DR

Abstract

Paper Structure (44 sections, 7 figures, 5 tables)

This paper contains 44 sections, 7 figures, 5 tables.

Introduction
Background & Related Work
Video Accessibility
Audio Description
Text Descriptions and Summaries
Beyond Descriptions
Heirarchical Summaries and Descriptions
Social Media Accessibility
Formative Study
Method
Findings
Current Practice
Short Form Video Accessibility
Platform Accessibility
Participant Suggested Accessibility Improvements
...and 29 more sections

Figures (7)

Figure 1: Participant ratings of video accessibility for pre-selected videos.
Figure 2: The ShortScribe interface consists of (a) front screen video information including the short description, username, caption, and audio title, (b) video controls, (c) a button to open the description pane which includes the long description, on-screen text, and shot-by-shot descriptions, and (d) video statistics. Video Credit: TikTok used with permission from @nourished.by.mads nourishedbymads.
Figure 3: ShortScribe takes a video as input, transcribes the audio using automatic speech recognition (ASR), segments the video into shots, and selects the middle frame of each shot as a keyframe. It then processes the transcript, generated image captions (BLIP-2), and on-screen text (OCR) to produce video data for each keyframe. We use a large language model (GPT-4) to summarize this data into a short, long, and shot-by-shot description.
Figure 4: We analyzed hallucinations in descriptions for 58 videos (long, short, 50-word descriptions) and for a subsample of 18 videos (per shot descriptions). Descriptions for each video contained 0-7 hallucinations. Short descriptions had the lowest percentage of videos with hallucinations, while shot-by-shot descriptions had the highest percentage of videos with hallucinations.
Figure 5: An analysis of the errors in one of the 2 of 58 videos that had more than three errors in the short description. The video depicts a lighthearted singalong. BLIP-2 mistakenly recognizes a toddler concentrating on singing as angry, and the on-screen text shows a quiz with the lyrics to to a sad song (All Too Well by Taylor Swift). The long description and then short description incorrectly infer that the video is sad.
...and 2 more figures

Making Short-Form Videos Accessible with Hierarchical Video Summaries

TL;DR

Abstract

Making Short-Form Videos Accessible with Hierarchical Video Summaries

Authors

TL;DR

Abstract

Table of Contents

Figures (7)