Table of Contents
Fetching ...

REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation

Nameer Hirschkind, Joseph Liu, Xiao Yu, Mahesh Kumar Nandwana

TL;DR

The paper tackles the challenge of balancing translation quality and latency in Simultaneous Speech Translation by introducing REINA, an information-theoretic loss that trains a READ/WRITE policy to adapt non-streaming S2TT models into streaming systems. REINA uses a mutual-information proxy computed from partial and full audio inputs and optimizes a joint loss that includes a monotonicity constraint and L2 regularization to yield robust, low-latency policies. Trained on large open-source datasets with a three-stage process, REINAStream achieves state-of-the-art streaming performance for models of comparable size and introduces NoSE as a fair, normalized efficiency metric. The work demonstrates strong low-latency BLEU improvements across MUST-C and CVSS-C, and its ablations highlight the importance of truncation training and monotonicity in policy learning, with plans to extend to SimulS2ST.

Abstract

Simultaneous Speech Translation (SimulST) systems stream in audio while simultaneously emitting translated text or speech. Such systems face the significant challenge of balancing translation quality and latency. We introduce a strategy to optimize this tradeoff: wait for more input only if you gain information by doing so. Based on this strategy, we present Regularized Entropy INformation Adaptation (REINA), a novel loss to train an adaptive policy using an existing non-streaming translation model. We derive REINA from information theory principles and show that REINA helps push the reported Pareto frontier of the latency/quality tradeoff over prior works. Utilizing REINA, we train a SimulST model on French, Spanish and German, both from and into English. Training on only open source or synthetically generated data, we achieve state-of-the-art (SOTA) streaming results for models of comparable size. We also introduce a metric for streaming efficiency, quantitatively showing REINA improves the latency/quality trade-off by as much as 21% compared to prior approaches, normalized against non-streaming baseline BLEU scores.

REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation

TL;DR

The paper tackles the challenge of balancing translation quality and latency in Simultaneous Speech Translation by introducing REINA, an information-theoretic loss that trains a READ/WRITE policy to adapt non-streaming S2TT models into streaming systems. REINA uses a mutual-information proxy computed from partial and full audio inputs and optimizes a joint loss that includes a monotonicity constraint and L2 regularization to yield robust, low-latency policies. Trained on large open-source datasets with a three-stage process, REINAStream achieves state-of-the-art streaming performance for models of comparable size and introduces NoSE as a fair, normalized efficiency metric. The work demonstrates strong low-latency BLEU improvements across MUST-C and CVSS-C, and its ablations highlight the importance of truncation training and monotonicity in policy learning, with plans to extend to SimulS2ST.

Abstract

Simultaneous Speech Translation (SimulST) systems stream in audio while simultaneously emitting translated text or speech. Such systems face the significant challenge of balancing translation quality and latency. We introduce a strategy to optimize this tradeoff: wait for more input only if you gain information by doing so. Based on this strategy, we present Regularized Entropy INformation Adaptation (REINA), a novel loss to train an adaptive policy using an existing non-streaming translation model. We derive REINA from information theory principles and show that REINA helps push the reported Pareto frontier of the latency/quality tradeoff over prior works. Utilizing REINA, we train a SimulST model on French, Spanish and German, both from and into English. Training on only open source or synthetically generated data, we achieve state-of-the-art (SOTA) streaming results for models of comparable size. We also introduce a metric for streaming efficiency, quantitatively showing REINA improves the latency/quality trade-off by as much as 21% compared to prior approaches, normalized against non-streaming baseline BLEU scores.

Paper Structure

This paper contains 21 sections, 6 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Non-streaming and streaming training procedures for REINAStream. For non-streaming training we use a trainable MT encoder to train on parallel NMT data. During streaming training we a) pass a full audio and truncated audio through the model, b) compute the cross-entropy (CE) loss of each, c) predict a policy using the policy network on top of the partial-audio output of the decoder, and finally d) calculate the REINA loss using the CE terms and policy predictions.
  • Figure 2: Average Lagging (AL) vs. BLEU score on MUST-C. Horizontal lines represent non-streaming performance.
  • Figure 3: AL/BLEU curve on Es$\rightarrow$En split of the CVSS-C dataset. We report ASR-BLEU only for StreamSpeech.
  • Figure A.1: We define NoSE as the area of the orange shaded region divided by the area of the blue rectangle.
  • Figure B.1: Average Lagging (AL) vs. BLEU score on CVSS-C. Dotted lines represent non-streaming BLEU scores. Note that StreamSpeech only reports ASR-BLEU in their paper, so we report StreamSpeech's ASR-BLEU rather than BLEU.
  • ...and 3 more figures