Table of Contents
Fetching ...

Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation

Henglyu Liu, Andong Chen, Kehai Chen, Xuefeng Bai, Meizhi Zhong, Yuan Qiu, Min Zhang

TL;DR

This work addresses the limited semantic alignment between speech and text representations in LLM-based speech translation. It introduces Adaptive Inner Speech-Text Alignment (AI-STA), which uses optimal transport to measure and minimize representation discrepancies and employs cross-modal retrieval to select beneficial LLM layers for alignment, followed by joint training on those layers. Experiments on CoVoST2 demonstrate state-of-the-art gains in En→Zh and En→Ja and reveal strong correlations between alignment quality and translation performance, as well as notable improvements in zero-shot knowledge transfer to text MT. The findings underscore the importance of inner-layer cross-modal alignment in LLMs and offer a practical framework for enhancing cross-modal speech translation, while also outlining limitations and avenues for future theoretical grounding.

Abstract

Recent advancement of large language models (LLMs) has led to significant breakthroughs across various tasks, laying the foundation for the development of LLM-based speech translation systems. Existing methods primarily focus on aligning inputs and outputs across modalities while overlooking deeper semantic alignment within model representations. To address this limitation, we propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within LLMs. To achieve this, we leverage the optimal transport (OT) theory to quantify fine-grained representation discrepancies between speech and text. Furthermore, we utilize the cross-modal retrieval technique to identify the layers that are best suited for alignment and perform joint training on these layers. Experimental results on speech translation (ST) tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches. Our findings highlight the importance of inner-layer speech-text alignment in LLMs and provide new insights into enhancing cross-modal learning.

Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation

TL;DR

This work addresses the limited semantic alignment between speech and text representations in LLM-based speech translation. It introduces Adaptive Inner Speech-Text Alignment (AI-STA), which uses optimal transport to measure and minimize representation discrepancies and employs cross-modal retrieval to select beneficial LLM layers for alignment, followed by joint training on those layers. Experiments on CoVoST2 demonstrate state-of-the-art gains in En→Zh and En→Ja and reveal strong correlations between alignment quality and translation performance, as well as notable improvements in zero-shot knowledge transfer to text MT. The findings underscore the importance of inner-layer cross-modal alignment in LLMs and offer a practical framework for enhancing cross-modal speech translation, while also outlining limitations and avenues for future theoretical grounding.

Abstract

Recent advancement of large language models (LLMs) has led to significant breakthroughs across various tasks, laying the foundation for the development of LLM-based speech translation systems. Existing methods primarily focus on aligning inputs and outputs across modalities while overlooking deeper semantic alignment within model representations. To address this limitation, we propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within LLMs. To achieve this, we leverage the optimal transport (OT) theory to quantify fine-grained representation discrepancies between speech and text. Furthermore, we utilize the cross-modal retrieval technique to identify the layers that are best suited for alignment and perform joint training on these layers. Experimental results on speech translation (ST) tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches. Our findings highlight the importance of inner-layer speech-text alignment in LLMs and provide new insights into enhancing cross-modal learning.

Paper Structure

This paper contains 27 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Different training paradigm: Modality Conversion implicitly learns speech-text relationships from paired data, focusing on end-to-end mapping. While Modality Alignment explicitly enforces semantic consistency by aligning representations through supervised objectives.
  • Figure 2: Model Architecture of Our LSM.
  • Figure 3: Overview of second and third stages of the proposed AI-STA. The left part first chose specific layers within the LLM according to its cross-modal retrieval ability. Then the right part obtains hidden states by separately forwarding speech or transcribed text concatenated with the same prompts and optimizes the LSM by combining alignment loss (computed via Wasserstein distance between hidden states) with cross-entropy loss.
  • Figure 4: Layer-wise trends of average mean reciprocal rank (MRR) in two distinct backbone LLMs for speech-to-text retrieval evaluation.
  • Figure 5: t-SNE visualization of speech and text representation from LSMs trained with or without AI-STA methods.
  • ...and 2 more figures