Table of Contents
Fetching ...

Connecting Voices: LoReSpeech as a Low-Resource Speech Parallel Corpus

Samy Ouzerrout

TL;DR

LoReSpeech addresses the scarcity of aligned audio data for under-represented languages by introducing a two-stage pipeline: LoReASR short audio-transcription alignments collected via a collaborative vocalization platform, followed by aligning long-form recordings to verse-level references using the Montreal Forced Aligner (MFA). The approach includes a two-tier validation (manual review and automatic TER-based checks) to ensure alignment quality. The resulting intra-language and inter-language audio alignments support direct speech-to-speech translation, strengthen multilingual ASR, and enable cross-language linguistic analysis and language preservation. Conducted within the Tutlayt AI project, the work offers a scalable, community-centered framework for building inclusive speech resources and outlines plans for data expansion and evaluation across more languages.

Abstract

Aligned audio corpora are fundamental to NLP technologies such as ASR and speech translation, yet they remain scarce for underrepresented languages, hindering their technological integration. This paper introduces a methodology for constructing LoReSpeech, a low-resource speech-to-speech translation corpus. Our approach begins with LoReASR, a sub-corpus of short audios aligned with their transcriptions, created through a collaborative platform. Building on LoReASR, long-form audio recordings, such as biblical texts, are aligned using tools like the MFA. LoReSpeech delivers both intra- and inter-language alignments, enabling advancements in multilingual ASR systems, direct speech-to-speech translation models, and linguistic preservation efforts, while fostering digital inclusivity. This work is conducted within Tutlayt AI project (https://tutlayt.fr).

Connecting Voices: LoReSpeech as a Low-Resource Speech Parallel Corpus

TL;DR

LoReSpeech addresses the scarcity of aligned audio data for under-represented languages by introducing a two-stage pipeline: LoReASR short audio-transcription alignments collected via a collaborative vocalization platform, followed by aligning long-form recordings to verse-level references using the Montreal Forced Aligner (MFA). The approach includes a two-tier validation (manual review and automatic TER-based checks) to ensure alignment quality. The resulting intra-language and inter-language audio alignments support direct speech-to-speech translation, strengthen multilingual ASR, and enable cross-language linguistic analysis and language preservation. Conducted within the Tutlayt AI project, the work offers a scalable, community-centered framework for building inclusive speech resources and outlines plans for data expansion and evaluation across more languages.

Abstract

Aligned audio corpora are fundamental to NLP technologies such as ASR and speech translation, yet they remain scarce for underrepresented languages, hindering their technological integration. This paper introduces a methodology for constructing LoReSpeech, a low-resource speech-to-speech translation corpus. Our approach begins with LoReASR, a sub-corpus of short audios aligned with their transcriptions, created through a collaborative platform. Building on LoReASR, long-form audio recordings, such as biblical texts, are aligned using tools like the MFA. LoReSpeech delivers both intra- and inter-language alignments, enabling advancements in multilingual ASR systems, direct speech-to-speech translation models, and linguistic preservation efforts, while fostering digital inclusivity. This work is conducted within Tutlayt AI project (https://tutlayt.fr).

Paper Structure

This paper contains 13 sections, 1 figure.

Figures (1)

  • Figure 1: vocalization interface for chechen