Table of Contents
Fetching ...

Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning

Shivam Ratnakant Mhaskar, Nirmesh J. Shah, Mohammadi Zaki, Ashishkumar P. Gudmalwar, Pankaj Wasnik, Rajiv Ratn Shah

TL;DR

This work tackles duration alignment in Automatic Video Dubbing by reframing neural machine translation as a phoneme-count–aware task. It introduces a reinforcement learning framework (RL-NMT) that rewards translations whose phoneme counts closely match the source sentence, and it further enhances quality-durational balance with a student–teacher distillation stage (ST-RL-NMT) using KL loss. A new Phoneme Count Compliance Score (PCC) quantifies alignment accuracy, and experiments on English–Hindi with the BPCC corpus show substantial PCC improvements (approximately 36% in PCC on some setups) while trading off some traditional translation quality metrics; the ST-RL-NMT variant mitigates this degradation. The approach leverages a strong teacher (IndicTrans2-FT) and standard MT metrics across Flores, BPCC, and movie-subtitle test sets to demonstrate improved duration alignment with practical dubbing implications. Overall, the paper contributes a scalable phoneme-based, RL-driven method and a concrete PCC metric to advance duration-synchronized AVD, with clear paths for broader language coverage and efficiency optimizations.

Abstract

Traditional Automatic Video Dubbing (AVD) pipeline consists of three key modules, namely, Automatic Speech Recognition (ASR), Neural Machine Translation (NMT), and Text-to-Speech (TTS). Within AVD pipelines, isometric-NMT algorithms are employed to regulate the length of the synthesized output text. This is done to guarantee synchronization with respect to the alignment of video and audio subsequent to the dubbing process. Previous approaches have focused on aligning the number of characters and words in the source and target language texts of Machine Translation models. However, our approach aims to align the number of phonemes instead, as they are closely associated with speech duration. In this paper, we present the development of an isometric NMT system using Reinforcement Learning (RL), with a focus on optimizing the alignment of phoneme counts in the source and target language sentence pairs. To evaluate our models, we propose the Phoneme Count Compliance (PCC) score, which is a measure of length compliance. Our approach demonstrates a substantial improvement of approximately 36% in the PCC score compared to the state-of-the-art models when applied to English-Hindi language pairs. Moreover, we propose a student-teacher architecture within the framework of our RL approach to maintain a trade-off between the phoneme count and translation quality.

Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning

TL;DR

This work tackles duration alignment in Automatic Video Dubbing by reframing neural machine translation as a phoneme-count–aware task. It introduces a reinforcement learning framework (RL-NMT) that rewards translations whose phoneme counts closely match the source sentence, and it further enhances quality-durational balance with a student–teacher distillation stage (ST-RL-NMT) using KL loss. A new Phoneme Count Compliance Score (PCC) quantifies alignment accuracy, and experiments on English–Hindi with the BPCC corpus show substantial PCC improvements (approximately 36% in PCC on some setups) while trading off some traditional translation quality metrics; the ST-RL-NMT variant mitigates this degradation. The approach leverages a strong teacher (IndicTrans2-FT) and standard MT metrics across Flores, BPCC, and movie-subtitle test sets to demonstrate improved duration alignment with practical dubbing implications. Overall, the paper contributes a scalable phoneme-based, RL-driven method and a concrete PCC metric to advance duration-synchronized AVD, with clear paths for broader language coverage and efficiency optimizations.

Abstract

Traditional Automatic Video Dubbing (AVD) pipeline consists of three key modules, namely, Automatic Speech Recognition (ASR), Neural Machine Translation (NMT), and Text-to-Speech (TTS). Within AVD pipelines, isometric-NMT algorithms are employed to regulate the length of the synthesized output text. This is done to guarantee synchronization with respect to the alignment of video and audio subsequent to the dubbing process. Previous approaches have focused on aligning the number of characters and words in the source and target language texts of Machine Translation models. However, our approach aims to align the number of phonemes instead, as they are closely associated with speech duration. In this paper, we present the development of an isometric NMT system using Reinforcement Learning (RL), with a focus on optimizing the alignment of phoneme counts in the source and target language sentence pairs. To evaluate our models, we propose the Phoneme Count Compliance (PCC) score, which is a measure of length compliance. Our approach demonstrates a substantial improvement of approximately 36% in the PCC score compared to the state-of-the-art models when applied to English-Hindi language pairs. Moreover, we propose a student-teacher architecture within the framework of our RL approach to maintain a trade-off between the phoneme count and translation quality.
Paper Structure (16 sections, 9 equations, 5 figures, 3 tables)

This paper contains 16 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Schema showing (a) block diagram of the proposed RL-NMT architecture (b) modified agent with student-teacher (ST) framework for quality-duration balance.
  • Figure 2: Example of quality degradation with RL-NMT and improvement achieved with ST-RL-NMT
  • Figure 3: Plot showing the different evaluation metrics at each RL-Step for (a) FLoRes, (b) Movie, (C) BPCC General and (d) BPCC Conversational Tests. Here, last step is with the student-teacher objective.
  • Figure 4: Trade-off between BLEU score vs. PCC score
  • Figure 5: A qualitative example of AVD using baseline and proposed approach.