Table of Contents
Fetching ...

EvolveCaptions: Empowering DHH Users Through Real-Time Collaborative Captioning

Liang-Yuan Wu, Dhruv Jain

TL;DR

EvolveCaptions addresses the inequities of ASR for Deaf and Hard of Hearing users by introducing a real-time, collaborative adaptation workflow. The method combines live caption correction, phonetically guided clause prompts, and lightweight speaker-specific fine-tuning to adapt Whisper-based ASR to individual voices during conversation. A lab study with 12 DHH and 6 hearing participants showed significant reductions in transcription errors and positive user experiences, supporting the viability of in-situ, collaborative personalization. The work advances equitable communication by reframing accessibility as a collective, learnable process and provides open-source artifacts for broader adoption.

Abstract

Automatic Speech Recognition (ASR) systems often fail to accurately transcribe speech from Deaf and Hard of Hearing (DHH) individuals, especially during real-time conversations. Existing personalization approaches typically require extensive pre-recorded data and place the burden of adaptation on the DHH speaker. We present EvolveCaptions, a real-time, collaborative ASR adaptation system that supports in-situ personalization with minimal effort. Hearing participants correct ASR errors during live conversations. Based on these corrections, the system generates short, phonetically targeted prompts for the DHH speaker to record, which are then used to fine-tune the ASR model. In a study with 12 DHH and six hearing participants, EvolveCaptions reduced Word Error Rate (WER) across all DHH users within one hour of use, using only five minutes of recording time on average. Participants described the system as intuitive, low-effort, and well-integrated into communication. These findings demonstrate the promise of collaborative, real-time ASR adaptation for more equitable communication.

EvolveCaptions: Empowering DHH Users Through Real-Time Collaborative Captioning

TL;DR

EvolveCaptions addresses the inequities of ASR for Deaf and Hard of Hearing users by introducing a real-time, collaborative adaptation workflow. The method combines live caption correction, phonetically guided clause prompts, and lightweight speaker-specific fine-tuning to adapt Whisper-based ASR to individual voices during conversation. A lab study with 12 DHH and 6 hearing participants showed significant reductions in transcription errors and positive user experiences, supporting the viability of in-situ, collaborative personalization. The work advances equitable communication by reframing accessibility as a collective, learnable process and provides open-source artifacts for broader adoption.

Abstract

Automatic Speech Recognition (ASR) systems often fail to accurately transcribe speech from Deaf and Hard of Hearing (DHH) individuals, especially during real-time conversations. Existing personalization approaches typically require extensive pre-recorded data and place the burden of adaptation on the DHH speaker. We present EvolveCaptions, a real-time, collaborative ASR adaptation system that supports in-situ personalization with minimal effort. Hearing participants correct ASR errors during live conversations. Based on these corrections, the system generates short, phonetically targeted prompts for the DHH speaker to record, which are then used to fine-tune the ASR model. In a study with 12 DHH and six hearing participants, EvolveCaptions reduced Word Error Rate (WER) across all DHH users within one hour of use, using only five minutes of recording time on average. Participants described the system as intuitive, low-effort, and well-integrated into communication. These findings demonstrate the promise of collaborative, real-time ASR adaptation for more equitable communication.

Paper Structure

This paper contains 43 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: EvolveCaptions user interface. (1) Clicking “Start ASR” begins generating real-time captions, while “Start Recording” displays a list of targeted clauses for the DHH speaker to record; (2) real-time captions reflect the DHH speaker’s speech; (3) hearing users can refine captions by correcting errors (yellow highlights) or flagging uncertain words (red highlights); (4) during recordings, the interface shows targeted samples along with live waveforms for guidance.
  • Figure 2: Word Error Rate (WER) improvement across four iterations using EvolveCaptions in our user study.
  • Figure 3: Participant recording durations across four sessions, with segment colors indicating session-specific durations. Totals are shown above each bar.
  • Figure 4: Comparison of captioning technologies for DHH users. The horizontal axis contrasts unilateral versus collaborative interaction, while the vertical axis contrasts static versus adaptive ASR models. EvolveCaptions is positioned in the top-right quadrant, representing a collaborative and adaptive approach.