Table of Contents
Fetching ...

BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Nithin Rao Koluguri, Piotr Żelasko, Jagadeesh Balam, Boris Ginsburg

TL;DR

The paper addresses the challenge of building a streaming, multitask SpeechLLM by proposing BESTOW, a cross-attention-based backbone that unifies offline and streaming inference under a read-write policy. By introducing a two-layer cross-attention speech feature extractor that treats speech features as keys/values and text prompts as queries, BESTOW enables end-to-end optimization with reduced computational costs and strong knowledge transfer from LLMs to speech. Experimental results show state-of-the-art or competitive performance across ASR, AST, SQA, and unseen tasks on large-scale multilingual data, with the streaming variant BESTOW-S offering tunable latency-quality tradeoffs. The work provides a practical open-source solution that scales to large data and supports multitask capabilities in real-time, potentially accelerating deployment of SpeechLLM systems in real-world applications.

Abstract

Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities. Moreover, there is no clear streaming solution for either style, especially considering the solution should generalize to speech multitask. We reformulate streamable SpeechLLM as a read-write policy problem and unifies the offline and streaming research with BESTOW architecture. Hence we demonstrate the first open-source SpeechLLM solution that enables Streaming and Multitask at scale (beyond ASR) at the same time. This streamable solution achieves very strong performance on a wide range of speech tasks (ASR, AST, SQA, unseen DynamicSuperb). It is end-to-end optimizable, with lower training/inference cost, and demonstrates LLM knowledge transferability to speech.

BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

TL;DR

The paper addresses the challenge of building a streaming, multitask SpeechLLM by proposing BESTOW, a cross-attention-based backbone that unifies offline and streaming inference under a read-write policy. By introducing a two-layer cross-attention speech feature extractor that treats speech features as keys/values and text prompts as queries, BESTOW enables end-to-end optimization with reduced computational costs and strong knowledge transfer from LLMs to speech. Experimental results show state-of-the-art or competitive performance across ASR, AST, SQA, and unseen tasks on large-scale multilingual data, with the streaming variant BESTOW-S offering tunable latency-quality tradeoffs. The work provides a practical open-source solution that scales to large data and supports multitask capabilities in real-time, potentially accelerating deployment of SpeechLLM systems in real-world applications.

Abstract

Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities. Moreover, there is no clear streaming solution for either style, especially considering the solution should generalize to speech multitask. We reformulate streamable SpeechLLM as a read-write policy problem and unifies the offline and streaming research with BESTOW architecture. Hence we demonstrate the first open-source SpeechLLM solution that enables Streaming and Multitask at scale (beyond ASR) at the same time. This streamable solution achieves very strong performance on a wide range of speech tasks (ASR, AST, SQA, unseen DynamicSuperb). It is end-to-end optimizable, with lower training/inference cost, and demonstrates LLM knowledge transferability to speech.
Paper Structure (17 sections, 4 figures, 4 tables)

This paper contains 17 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Reformulate streamable SpeechLLM as a read-write policy problem previously used in simultaneous translation, i.e. the LLM agents should start replying whenever they think they have gotten enough information.
  • Figure 2: (a) SALM architecture chen2024salm (as an example of Speech-LLaMA). (b) the proposed BESTOW architecture, which uses cross-attention to extract task-relevant features using the text as queries and speech features as keys/values. Compared with the former, the proposed model has lower runtime complexity, while achieving the state-of-the-art performance on multiple tasks.
  • Figure 3: Latency-Quality tradeoff curves of streaming ASR with BESTOW-S-bidi.
  • Figure 4: Latency-Quality tradeoff curves of simultaneous speech translation with BESTOW-S-bidi.