CrossA11y: Identifying Video Accessibility Issues via Cross-modal Grounding

Xingyu "Bruce" Liu; Ruolin Wang; Dingzeyu Li; Xiang 'Anthony' Chen; Amy Pavel

CrossA11y: Identifying Video Accessibility Issues via Cross-modal Grounding

Xingyu "Bruce" Liu, Ruolin Wang, Dingzeyu Li, Xiang 'Anthony' Chen, Amy Pavel

TL;DR

CrossA11y tackles the laborious task of making videos accessible by automatically detecting visual-audio modality asymmetries through cross-modal grounding. The system segments audio and visuals, matches segments using MIL-NCE and MMV embeddings, and surfaces inaccessible moments in a unified authoring interface for AD and CC, followed by preview and export. A lab study with 11 participants and a pair of content creators shows that CrossA11y improves recall and precision for accessibility issues, reduces mental workload, and enables more efficient AD/CC authoring. The work demonstrates practical impact by integrating cross-modal analysis into a video production workflow and highlights future opportunities to enhance segmentation, prioritization, and broader applicability across media types.

Abstract

Authors make their videos visually accessible by adding audio descriptions (AD), and auditorily accessible by adding closed captions (CC). However, creating AD and CC is challenging and tedious, especially for non-professional describers and captioners, due to the difficulty of identifying accessibility problems in videos. A video author will have to watch the video through and manually check for inaccessible information frame-by-frame, for both visual and auditory modalities. In this paper, we present CrossA11y, a system that helps authors efficiently detect and address visual and auditory accessibility issues in videos. Using cross-modal grounding analysis, CrossA11y automatically measures accessibility of visual and audio segments in a video by checking for modality asymmetries. CrossA11y then displays these segments and surfaces visual and audio accessibility issues in a unified interface, making it intuitive to locate, review, script AD/CC in-place, and preview the described and captioned video immediately. We demonstrate the effectiveness of CrossA11y through a lab study with 11 participants, comparing to existing baseline.

CrossA11y: Identifying Video Accessibility Issues via Cross-modal Grounding

TL;DR

Abstract

Paper Structure (48 sections, 4 equations, 8 figures, 5 tables)

This paper contains 48 sections, 4 equations, 8 figures, 5 tables.

Introduction
Related Work
Video Accessibility Guidelines
Authoring AD and CC
Accessibility Assessment Tools
Assessing Audio and Visual Similarity
CrossA11y Interface
Video Pane
Video Description Pane
Captions Pane
Accessible Video Preview
Cross-modal Grounding Pipeline
Segmentation
Cross-modal Grounding
Visual-text/audio Matching
...and 33 more sections

Figures (8)

Figure 1: In CrossA11y's interface, the video pane (A) displays audio and visual timelines with accessibility visualization that allows authors to quickly identify and navigate to accessibility issues. The video description pane (F) surfaces inaccessible visual segments and lets authors to add descriptions. The captions pane (E) provides time-aligned captions and detected non-speech sound segments for authors to seek within the video and add captions.
Figure 2: Authors can hover on segments of visual/audio timelines in CrossA11y to inspect the computed correspondence between the selected segment and segments in the other modality. Here, opaque segments in the audio track are the segments predicted by the system to match the segment in the video track.
Figure 3: (A) After editing their description, authors can add the description to the video by clicking on "Save". The vertical side bar and the horizontal bar of the corresponding segment will change to blue color, indicating that this problem has been addressed. (B) Authors can also dismiss a problem by clicking on "Dismiss". The bars will turn gray indicating that this problem has been dismissed.
Figure 4: Our computational pipeline includes: (A) Segmentation that segments the video into audio and visual segments, (B) Cross-modal Grounding that finds correspondences between visual and audio segments, and (C) post-processing that filters identified correspondences.
Figure 5: Example of visual to audio similarity matrix, and visual to text (transcript) similarity matrix. Red dotted lines highlight examples of asymmetric and potentially inaccessible segments.
...and 3 more figures

CrossA11y: Identifying Video Accessibility Issues via Cross-modal Grounding

TL;DR

Abstract

CrossA11y: Identifying Video Accessibility Issues via Cross-modal Grounding

TL;DR

Abstract

Table of Contents

Figures (8)