Table of Contents
Fetching ...

Making Short-Form Videos Accessible with Hierarchical Video Summaries

Tess Van Daele, Akhil Iyer, Yuning Zhang, Jalyn C. Derry, Mina Huh, Amy Pavel

TL;DR

This paper tackles the BLV accessibility gap in short-form videos by introducing ShortScribe, a system that generates hierarchical video descriptions through a multi-modal pipeline (ASR, OCR, BLIP-2, CLIP, GPT-4). It provides three levels of descriptions (short, long, and shot-by-shot) plus on-screen text, enabling BLV users to quickly skim content and progressively access details. A formative study informs design and confirms a need for on-demand, non-time-aligned descriptions tailored to fast-scrolling feeds, while a within-subject user study with 10 BLV participants shows significant gains in video comprehension and accuracy when using ShortScribe versus a baseline. The work demonstrates the feasibility and utility of hierarchical, AI-generated descriptions for short-form video accessibility and discusses platform-level recommendations and future work to extend to other video formats and domains.

Abstract

Short videos on platforms such as TikTok, Instagram Reels, and YouTube Shorts (i.e. short-form videos) have become a primary source of information and entertainment. Many short-form videos are inaccessible to blind and low vision (BLV) viewers due to their rapid visual changes, on-screen text, and music or meme-audio overlays. In our formative study, 7 BLV viewers who regularly watched short-form videos reported frequently skipping such inaccessible content. We present ShortScribe, a system that provides hierarchical visual summaries of short-form videos at three levels of detail to support BLV viewers in selecting and understanding short-form videos. ShortScribe allows BLV users to navigate between video descriptions based on their level of interest. To evaluate ShortScribe, we assessed description accuracy and conducted a user study with 10 BLV participants comparing ShortScribe to a baseline interface. When using ShortScribe, participants reported higher comprehension and provided more accurate summaries of video content.

Making Short-Form Videos Accessible with Hierarchical Video Summaries

TL;DR

This paper tackles the BLV accessibility gap in short-form videos by introducing ShortScribe, a system that generates hierarchical video descriptions through a multi-modal pipeline (ASR, OCR, BLIP-2, CLIP, GPT-4). It provides three levels of descriptions (short, long, and shot-by-shot) plus on-screen text, enabling BLV users to quickly skim content and progressively access details. A formative study informs design and confirms a need for on-demand, non-time-aligned descriptions tailored to fast-scrolling feeds, while a within-subject user study with 10 BLV participants shows significant gains in video comprehension and accuracy when using ShortScribe versus a baseline. The work demonstrates the feasibility and utility of hierarchical, AI-generated descriptions for short-form video accessibility and discusses platform-level recommendations and future work to extend to other video formats and domains.

Abstract

Short videos on platforms such as TikTok, Instagram Reels, and YouTube Shorts (i.e. short-form videos) have become a primary source of information and entertainment. Many short-form videos are inaccessible to blind and low vision (BLV) viewers due to their rapid visual changes, on-screen text, and music or meme-audio overlays. In our formative study, 7 BLV viewers who regularly watched short-form videos reported frequently skipping such inaccessible content. We present ShortScribe, a system that provides hierarchical visual summaries of short-form videos at three levels of detail to support BLV viewers in selecting and understanding short-form videos. ShortScribe allows BLV users to navigate between video descriptions based on their level of interest. To evaluate ShortScribe, we assessed description accuracy and conducted a user study with 10 BLV participants comparing ShortScribe to a baseline interface. When using ShortScribe, participants reported higher comprehension and provided more accurate summaries of video content.
Paper Structure (44 sections, 7 figures, 5 tables)

This paper contains 44 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Participant ratings of video accessibility for pre-selected videos.
  • Figure 2: The ShortScribe interface consists of (a) front screen video information including the short description, username, caption, and audio title, (b) video controls, (c) a button to open the description pane which includes the long description, on-screen text, and shot-by-shot descriptions, and (d) video statistics. Video Credit: TikTok used with permission from @nourished.by.mads nourishedbymads.
  • Figure 3: ShortScribe takes a video as input, transcribes the audio using automatic speech recognition (ASR), segments the video into shots, and selects the middle frame of each shot as a keyframe. It then processes the transcript, generated image captions (BLIP-2), and on-screen text (OCR) to produce video data for each keyframe. We use a large language model (GPT-4) to summarize this data into a short, long, and shot-by-shot description.
  • Figure 4: We analyzed hallucinations in descriptions for 58 videos (long, short, 50-word descriptions) and for a subsample of 18 videos (per shot descriptions). Descriptions for each video contained 0-7 hallucinations. Short descriptions had the lowest percentage of videos with hallucinations, while shot-by-shot descriptions had the highest percentage of videos with hallucinations.
  • Figure 5: An analysis of the errors in one of the 2 of 58 videos that had more than three errors in the short description. The video depicts a lighthearted singalong. BLIP-2 mistakenly recognizes a toddler concentrating on singing as angry, and the on-screen text shows a quiz with the lyrics to to a sad song (All Too Well by Taylor Swift). The long description and then short description incorrectly infer that the video is sad.
  • ...and 2 more figures