Table of Contents
Fetching ...

SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation

Chunyu Sun, Bingyu Liu, Zhichao Cui, Junhan Shi, Anbin Qi, Tian-hao Zhang, Dinghao Zhou, Lewei Lu

TL;DR

The paper tackles high latency and error propagation in speech RAG by introducing an end-to-end speech-text embedding framework that directly maps speech and text into a shared semantic space. It employs separate speech and text encoders with a common projection, and a two-stage training regime: alignment pre-training and retrieval fine-tuning with multi-task losses. Empirical results on CMTEB and additional datasets show the approach outperforms traditional ASR-based pipelines while halving latency and achieving notable retrieval accuracy gains, supported by robust data filtering and large-scale training. The work demonstrates a viable path for real-time, retrieval-augmented SLLMs and points to future directions involving discrete-token inputs for even more efficient speech-language understanding.

Abstract

Embedding-based retrieval models have made significant strides in retrieval-augmented generation (RAG) techniques for text and multimodal large language models (LLMs) applications. However, when it comes to speech larage language models (SLLMs), these methods are limited to a two-stage process, where automatic speech recognition (ASR) is combined with text-based retrieval. This sequential architecture suffers from high latency and error propagation. To address these limitations, we propose a unified embedding framework that eliminates the need for intermediate text representations. Specifically, the framework includes separate speech and text encoders, followed by a shared scaling layer that maps both modalities into a common embedding space. Our model reduces pipeline latency by 50\% while achieving higher retrieval accuracy compared to traditional two-stage methods. We also provide a theoretical analysis of the challenges inherent in end-to-end speech retrieval and introduce architectural principles for effective speech-to-document matching. Extensive experiments demonstrate the robustness of our approach across diverse acoustic conditions and speaker variations, paving the way for a new paradigm in multimodal SLLMs retrieval systems.

SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation

TL;DR

The paper tackles high latency and error propagation in speech RAG by introducing an end-to-end speech-text embedding framework that directly maps speech and text into a shared semantic space. It employs separate speech and text encoders with a common projection, and a two-stage training regime: alignment pre-training and retrieval fine-tuning with multi-task losses. Empirical results on CMTEB and additional datasets show the approach outperforms traditional ASR-based pipelines while halving latency and achieving notable retrieval accuracy gains, supported by robust data filtering and large-scale training. The work demonstrates a viable path for real-time, retrieval-augmented SLLMs and points to future directions involving discrete-token inputs for even more efficient speech-language understanding.

Abstract

Embedding-based retrieval models have made significant strides in retrieval-augmented generation (RAG) techniques for text and multimodal large language models (LLMs) applications. However, when it comes to speech larage language models (SLLMs), these methods are limited to a two-stage process, where automatic speech recognition (ASR) is combined with text-based retrieval. This sequential architecture suffers from high latency and error propagation. To address these limitations, we propose a unified embedding framework that eliminates the need for intermediate text representations. Specifically, the framework includes separate speech and text encoders, followed by a shared scaling layer that maps both modalities into a common embedding space. Our model reduces pipeline latency by 50\% while achieving higher retrieval accuracy compared to traditional two-stage methods. We also provide a theoretical analysis of the challenges inherent in end-to-end speech retrieval and introduce architectural principles for effective speech-to-document matching. Extensive experiments demonstrate the robustness of our approach across diverse acoustic conditions and speaker variations, paving the way for a new paradigm in multimodal SLLMs retrieval systems.

Paper Structure

This paper contains 21 sections, 13 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: The traditional two-stage speech retrieval system and our proposed unified speech-text embedding based retrieval system, where * indicates that some modules in the model can be shared.
  • Figure 2: Overview of Our methods. The stage 1 is align training and the stage 2 is contrastive training