Table of Contents
Fetching ...

Speech Retrieval-Augmented Generation without Automatic Speech Recognition

Do June Min, Karel Mundnich, Andy Lapastora, Erfan Soltanmohammadi, Srikanth Ronanki, Kyu Han

TL;DR

SpeechRAG, a novel framework designed for open-question answering over spoken data, fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model (LLM)–based retrieval model, and outperforms cascaded text-based models when there is high WER in the transcripts.

Abstract

One common approach for question answering over speech data is to first transcribe speech using automatic speech recognition (ASR) and then employ text-based retrieval-augmented generation (RAG) on the transcriptions. While this cascaded pipeline has proven effective in many practical settings, ASR errors can propagate to the retrieval and generation steps. To overcome this limitation, we introduce SpeechRAG, a novel framework designed for open-question answering over spoken data. Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model (LLM)--based retrieval model. By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries, leveraging the retrieval capacity of the frozen text retriever. Our retrieval experiments on spoken question answering datasets show that direct speech retrieval does not degrade over the text-based baseline, and outperforms the cascaded systems using ASR. For generation, we use a speech language model (SLM) as a generator, conditioned on audio passages rather than transcripts. Without fine-tuning of the SLM, this approach outperforms cascaded text-based models when there is high WER in the transcripts.

Speech Retrieval-Augmented Generation without Automatic Speech Recognition

TL;DR

SpeechRAG, a novel framework designed for open-question answering over spoken data, fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model (LLM)–based retrieval model, and outperforms cascaded text-based models when there is high WER in the transcripts.

Abstract

One common approach for question answering over speech data is to first transcribe speech using automatic speech recognition (ASR) and then employ text-based retrieval-augmented generation (RAG) on the transcriptions. While this cascaded pipeline has proven effective in many practical settings, ASR errors can propagate to the retrieval and generation steps. To overcome this limitation, we introduce SpeechRAG, a novel framework designed for open-question answering over spoken data. Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model (LLM)--based retrieval model. By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries, leveraging the retrieval capacity of the frozen text retriever. Our retrieval experiments on spoken question answering datasets show that direct speech retrieval does not degrade over the text-based baseline, and outperforms the cascaded systems using ASR. For generation, we use a speech language model (SLM) as a generator, conditioned on audio passages rather than transcripts. Without fine-tuning of the SLM, this approach outperforms cascaded text-based models when there is high WER in the transcripts.

Paper Structure

This paper contains 16 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: (a) Our speech retriever consists of an adapter that projects speech into the embedding space of the text retrieval model. During training, we use distillation from the text embedding of the transcript of the speech to refine our speech embedding. This allows us to leverage the frozen text retriever's capacity during a similarity search with the query. During testing, we use the text branch (Fig 1a, left) to embed text queries and the speech branch (Fig 1a, right) to embed speech passages. (b) Our SpeechRAG framework consists of a speech retriever and an SLM (Figure 1b, top) and operates directly on speech. On the other hand, ASR-based cascaded baselines (Fig 1b, bottom) first transcribe the audio and use text-based RAG, leading to the propagation of ASR errors in retrieval and generation.
  • Figure 2: Retrieval performance comparison with injected noise. We inject Gaussian noise to the speech signals and compare the text-based vs. our end-to-end retriever at different SNRs.
  • Figure 3: SpokenSQuAD generations of the fully-cascaded model vs our ASR-less SpeechRAG framework. The named entity transcription error in the ASR step propagates to the generation step, while the SLM of SpeechRAG correctly generates the named entity.