USpeech: Ultrasound-Enhanced Speech with Minimal Human Effort via Cross-Modal Synthesis
Luca Jiang-Tao Yu, Running Zhao, Sijie Ji, Edith C. H. Ngai, Chenshu Wu
TL;DR
USpeech tackles data scarcity in ultrasound-based speech enhancement by introducing a cross-modal ultrasound synthesis framework that uses audio as a bridge between video and ultrasound. The method uses a two-stage synthesis pipeline with contrastive video-audio pre-training and an audio-ultrasound encoder-decoder, followed by a UNet-Transformer based speech enhancement network and neural vocoder for waveform recovery. It achieves strong performance gains over state-of-the-art baselines, with synthetic ultrasound data matching or closely approaching results obtained from physically collected data, and demonstrates effective generalization to large-scale synthetic datasets and real-world scenarios. This approach reduces human data collection effort while enabling scalable, robust ultrasound-enhanced speech applications across noisy environments.
Abstract
Speech enhancement is crucial for ubiquitous human-computer interaction. Recently, ultrasound-based acoustic sensing has emerged as an attractive choice for speech enhancement because of its superior ubiquity and performance. However, due to inevitable interference from unexpected and unintended sources during audio-ultrasound data acquisition, existing solutions rely heavily on human effort for data collection and processing. This leads to significant data scarcity that limits the full potential of ultrasound-based speech enhancement. To address this, we propose USpeech, a cross-modal ultrasound synthesis framework for speech enhancement with minimal human effort. At its core is a two-stage framework that establishes the correspondence between visual and ultrasonic modalities by leveraging audio as a bridge. This approach overcomes challenges from the lack of paired video-ultrasound datasets and the inherent heterogeneity between video and ultrasound data. Our framework incorporates contrastive video-audio pre-training to project modalities into a shared semantic space and employs an audio-ultrasound encoder-decoder for ultrasound synthesis. We then present a speech enhancement network that enhances speech in the time-frequency domain and recovers the clean speech waveform via a neural vocoder. Comprehensive experiments show USpeech achieves remarkable performance using synthetic ultrasound data comparable to physical data, outperforming state-of-the-art ultrasound-based speech enhancement baselines. USpeech is open-sourced at https://github.com/aiot-lab/USpeech/.
