Table of Contents
Fetching ...

HI-TransPA: Hearing Impairments Translation Personal Assistant

Zhiming Ma, Shiyu Gan, Junhao Zhao, Xianming Li, Qingyun Pan, Peidong Wang, Mingjun Pan, Yuhao Mo, Jiajie Cheng, Chengxin Chen, Zhonglun Cao, Chonghan Liu, Shi Cheng

TL;DR

This work tackles barriers in bidirectional communication for hearing-impaired individuals by presenting HI-TransPA, an instruction-driven audio-visual Omni-Model that fuses high-frame-rate lip dynamics with audio to enable both translation and dialogue. It introduces a comprehensive data preprocessing and quality-aware curriculum to learn from imperfect real-world speech, along with a redesigned vision component (SigLIP) and a Unified 3D-Resampler to efficiently encode lip motion. The approach achieves state-of-the-art results on the HI-Dialogue dataset, improving both literal transcription accuracy ($CER$) and semantic coherence (EmbSim) through curriculum-based training. By delivering an end-to-end framework tailored for accessibility, HI-TransPA lays a foundation for future Omni-Model–driven assistive communication technologies with practical impact on daily interactions for the hearing impaired.

Abstract

Hearing-impaired individuals often face significant barriers in daily communication due to the inherent challenges of producing clear speech. To address this, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with lip dynamics, enabling both translation and dialogue within a single multimodal framework. To address the distinctive pronunciation patterns of hearing-impaired speech and the limited adaptability of existing models, we develop a multimodal preprocessing and curation pipeline that detects facial landmarks, stabilizes the lip region, and quantitatively evaluates sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. Architecturally, we employs a novel unified 3D-Resampler to efficiently encode the lip dynamics, which is critical for accurate interpretation. Experiments on purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. Our work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.

HI-TransPA: Hearing Impairments Translation Personal Assistant

TL;DR

This work tackles barriers in bidirectional communication for hearing-impaired individuals by presenting HI-TransPA, an instruction-driven audio-visual Omni-Model that fuses high-frame-rate lip dynamics with audio to enable both translation and dialogue. It introduces a comprehensive data preprocessing and quality-aware curriculum to learn from imperfect real-world speech, along with a redesigned vision component (SigLIP) and a Unified 3D-Resampler to efficiently encode lip motion. The approach achieves state-of-the-art results on the HI-Dialogue dataset, improving both literal transcription accuracy () and semantic coherence (EmbSim) through curriculum-based training. By delivering an end-to-end framework tailored for accessibility, HI-TransPA lays a foundation for future Omni-Model–driven assistive communication technologies with practical impact on daily interactions for the hearing impaired.

Abstract

Hearing-impaired individuals often face significant barriers in daily communication due to the inherent challenges of producing clear speech. To address this, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with lip dynamics, enabling both translation and dialogue within a single multimodal framework. To address the distinctive pronunciation patterns of hearing-impaired speech and the limited adaptability of existing models, we develop a multimodal preprocessing and curation pipeline that detects facial landmarks, stabilizes the lip region, and quantitatively evaluates sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. Architecturally, we employs a novel unified 3D-Resampler to efficiently encode the lip dynamics, which is critical for accurate interpretation. Experiments on purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. Our work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.

Paper Structure

This paper contains 33 sections, 14 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: HI-TransPA functions as (a) a translator and (b) an intelligent assistant for individuals with hearing impairments.
  • Figure 2: HI-TransPA System Architecture.
  • Figure 3: Overview of the lip region extraction pipeline. The left column shows input video frames at different timestamps. In the middle, a face landmark detector identifies 468 facial keypoints (purple). The right column isolates and tracks lip landmarks across frames to produce stabilized lip crops used by the vision encoder.
  • Figure 4: Overview of the rejection sampling pipeline. Each audio-visual sample is scored based on quality metrics from both modalities and then divided into accepted ($\mathcal{D}_{\text{accept}}$) and rejected ($\mathcal{D}_{\text{reject}}$) subsets for curriculum learning.
  • Figure 5: Interpretation of the Comprehensive Score (CS). Models are plotted in the $(1-\text{CER})$ vs. EmbSim space. Higher positions and rightward movement correspond to lower error rates and stronger semantic consistency, jointly yielding higher CS values.
  • ...and 1 more figures