Table of Contents
Fetching ...

A Modular-based Strategy for Mitigating Gradient Conflicts in Simultaneous Speech Translation

Xiaoqian Liu, Yangfan Du, Jianjin Wang, Yuan Ge, Chen Xu, Tong Xiao, Guocheng Chen, Jingbo Zhu

TL;DR

This work tackles gradient conflicts in multitask learning for simultaneous speech translation by introducing Modular Gradient Conflict Mitigation (MGCM). MGCM modularizes the model into components (LN, FFN, Attention) and detects conflicts per module using cosine similarity, projecting auxiliary gradients onto a plane orthogonal to the primary when conflicts exist, thereby avoiding the memory-heavy practice of concatenating all gradients. The method achieves substantial memory efficiency and improved translation quality, notably under medium and high latency, with offline BLEU gains around +0.68 (Greedy) and +0.63 (Beam5) in a DiSeg baseline, and memory savings exceeding 95% relative to other conflict-resolution approaches. Experiments on MuST-C En-De demonstrate that MGCM outperforms model-level approaches like PCGrad and simple discard strategies, while maintaining statistical significance (p < 0.05) and scalable memory requirements for larger models. Overall, MGCM provides a practical, scalable solution for real-time MT in SimulST by mitigating gradient conflicts at the modular level and delivering strong performance with reduced GPU memory consumption.

Abstract

Simultaneous Speech Translation (SimulST) involves generating target language text while continuously processing streaming speech input, presenting significant real-time challenges. Multi-task learning is often employed to enhance SimulST performance but introduces optimization conflicts between primary and auxiliary tasks, potentially compromising overall efficiency. The existing model-level conflict resolution methods are not well-suited for this task which exacerbates inefficiencies and leads to high GPU memory consumption. To address these challenges, we propose a Modular Gradient Conflict Mitigation (MGCM) strategy that detects conflicts at a finer-grained modular level and resolves them utilizing gradient projection. Experimental results demonstrate that MGCM significantly improves SimulST performance, particularly under medium and high latency conditions, achieving a 0.68 BLEU score gain in offline tasks. Additionally, MGCM reduces GPU memory consumption by over 95\% compared to other conflict mitigation methods, establishing it as a robust solution for SimulST tasks.

A Modular-based Strategy for Mitigating Gradient Conflicts in Simultaneous Speech Translation

TL;DR

This work tackles gradient conflicts in multitask learning for simultaneous speech translation by introducing Modular Gradient Conflict Mitigation (MGCM). MGCM modularizes the model into components (LN, FFN, Attention) and detects conflicts per module using cosine similarity, projecting auxiliary gradients onto a plane orthogonal to the primary when conflicts exist, thereby avoiding the memory-heavy practice of concatenating all gradients. The method achieves substantial memory efficiency and improved translation quality, notably under medium and high latency, with offline BLEU gains around +0.68 (Greedy) and +0.63 (Beam5) in a DiSeg baseline, and memory savings exceeding 95% relative to other conflict-resolution approaches. Experiments on MuST-C En-De demonstrate that MGCM outperforms model-level approaches like PCGrad and simple discard strategies, while maintaining statistical significance (p < 0.05) and scalable memory requirements for larger models. Overall, MGCM provides a practical, scalable solution for real-time MT in SimulST by mitigating gradient conflicts at the modular level and delivering strong performance with reduced GPU memory consumption.

Abstract

Simultaneous Speech Translation (SimulST) involves generating target language text while continuously processing streaming speech input, presenting significant real-time challenges. Multi-task learning is often employed to enhance SimulST performance but introduces optimization conflicts between primary and auxiliary tasks, potentially compromising overall efficiency. The existing model-level conflict resolution methods are not well-suited for this task which exacerbates inefficiencies and leads to high GPU memory consumption. To address these challenges, we propose a Modular Gradient Conflict Mitigation (MGCM) strategy that detects conflicts at a finer-grained modular level and resolves them utilizing gradient projection. Experimental results demonstrate that MGCM significantly improves SimulST performance, particularly under medium and high latency conditions, achieving a 0.68 BLEU score gain in offline tasks. Additionally, MGCM reduces GPU memory consumption by over 95\% compared to other conflict mitigation methods, establishing it as a robust solution for SimulST tasks.
Paper Structure (17 sections, 4 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 4 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison gradient conflicts of different model granularity using cosine similarity. Coarse-grained model-level gradients show no conflicts (top). However, the finer-grained level reveals conflicts between the decoders (bottom), illustrating the concept of gradient conflict masking.
  • Figure 2: Gradient detection and projection in MTL. (a) detects conflicts by calculating the cosine similarity between the primary and auxiliary tasks, illustrating whether the state is without or with conflicts. (b) captures the conflicts and demonstrates the projection process to mitigate them.
  • Figure 3: We initially test a Transformer-based model (with a differentiable simultaneous strategy) across three scenarios: single-task, multi-task, and MGCM-enhanced multi-task learning. Subsequently, we employ the more robust Diseg model as a baseline to further evaluate the MGCM method.
  • Figure 4: The performance of various gradient conflict mitigation methods, with all models utilizing the Diseg model as the baseline.
  • Figure 5: The probability of conflicts between SimulST and SimulASR at the component level (Attn, FFN, LN) across different Transformer layers. Layers 1-6 represent the encoders, and layers 7-12 represent the decoders.