Table of Contents
Fetching ...

Chain-of-Thought Prompting for Speech Translation

Ke Hu, Zhehuai Chen, Chao-Han Huck Yang, Piotr Żelasko, Oleksii Hrinchuk, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

TL;DR

This work introduces chain-of-thought prompting for a Speech-LLM built on an encoder–decoder Megatron-T5, leveraging ASR transcripts as prompts to guide automatic speech translation. The two-step CoT process first decodes speech to ASR transcripts and then prompts the LLM with both the transcripts and encoded speech, using LoRA for efficient adaptation. Across six language directions on FLEURS using Canary AST data, the method achieves an average BLEU improvement of $2.4$ over a speech-prompt baseline, with even larger gains when ASR transcripts are closer to ground truth ($2.7$ BLEU). The approach outperforms a CoT-prediction baseline, large multitask models, and cascade systems, demonstrating the practical impact of ASR-informed prompting in encoder–decoder LLMs for cross-lingual speech translation.

Abstract

Large language models (LLMs) have demonstrated remarkable advancements in language understanding and generation. Building on the success of text-based LLMs, recent research has adapted these models to use speech embeddings for prompting, resulting in Speech-LLM models that exhibit strong performance in automatic speech recognition (ASR) and automatic speech translation (AST). In this work, we propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLM model consists of a speech encoder and an encoder-decoder structure Megatron-T5. By first decoding speech to generate ASR transcripts and subsequently using these transcripts along with encoded speech for prompting, we guide the speech translation in a two-step process like chain-of-thought (CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for model adaptation and shows superior performance to full model fine-tuning. Experimental results show that the proposed CoT prompting significantly improves AST performance, achieving an average increase of 2.4 BLEU points across 6 En->X or X->En AST tasks compared to speech prompting alone. Additionally, compared to a related CoT prediction method that predicts a concatenated sequence of ASR and AST transcripts, our method performs better by an average of 2 BLEU points.

Chain-of-Thought Prompting for Speech Translation

TL;DR

This work introduces chain-of-thought prompting for a Speech-LLM built on an encoder–decoder Megatron-T5, leveraging ASR transcripts as prompts to guide automatic speech translation. The two-step CoT process first decodes speech to ASR transcripts and then prompts the LLM with both the transcripts and encoded speech, using LoRA for efficient adaptation. Across six language directions on FLEURS using Canary AST data, the method achieves an average BLEU improvement of over a speech-prompt baseline, with even larger gains when ASR transcripts are closer to ground truth ( BLEU). The approach outperforms a CoT-prediction baseline, large multitask models, and cascade systems, demonstrating the practical impact of ASR-informed prompting in encoder–decoder LLMs for cross-lingual speech translation.

Abstract

Large language models (LLMs) have demonstrated remarkable advancements in language understanding and generation. Building on the success of text-based LLMs, recent research has adapted these models to use speech embeddings for prompting, resulting in Speech-LLM models that exhibit strong performance in automatic speech recognition (ASR) and automatic speech translation (AST). In this work, we propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLM model consists of a speech encoder and an encoder-decoder structure Megatron-T5. By first decoding speech to generate ASR transcripts and subsequently using these transcripts along with encoded speech for prompting, we guide the speech translation in a two-step process like chain-of-thought (CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for model adaptation and shows superior performance to full model fine-tuning. Experimental results show that the proposed CoT prompting significantly improves AST performance, achieving an average increase of 2.4 BLEU points across 6 En->X or X->En AST tasks compared to speech prompting alone. Additionally, compared to a related CoT prediction method that predicts a concatenated sequence of ASR and AST transcripts, our method performs better by an average of 2 BLEU points.
Paper Structure (12 sections, 1 figure, 6 tables)

This paper contains 12 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Diagram of the proposed chain-of-thought (CoT) prompting model. The fixed text prompt, ASR text hypotheses, and speech encodings are concatenated to a single sequence to prompt the Megatron-T5 for translation.