Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

Jeongsoo Choi; Jaehun Kim; Joon Son Chung

Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

Jeongsoo Choi, Jaehun Kim, Joon Son Chung

TL;DR

Dub-S2ST tackles cross-lingual dubbing by replacing text-based translation with a textless pipeline that preserves duration, speaker identity, and speaking speed. It introduces a discrete diffusion-based speech-to-unit translator with explicit length control, coupled with a unit-based speed adaptation, and a diffusion-based unit-to-speech synthesizer conditioned on the source speech, speaker, and unit sequence. The approach achieves accurate duration matching, natural prosody, and competitive translation quality, validated through quantitative metrics and human MOS evaluations, with ablations confirming the contributions of each component. This framework enables seamless dubbing across languages and demonstrates practical impact for multilingual media with minimal post-processing. The work also provides a public codebase for reproducibility.

Abstract

This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitability for dubbing applications. To address this, we propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation. We then synthesize speech based on the translated units and source speaker's identity using a conditional flow matching model. Additionally, we introduce a unit-based speed adaptation mechanism that guides the translation model to produce speech at a rate consistent with the source, without relying on any text. Extensive experiments demonstrate that our framework generates natural and fluent translations that align with the original speech's duration and speaking pace, while achieving competitive translation performance. The code is available at https://github.com/kaistmm/Dub-S2ST.

Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

TL;DR

Abstract

Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)