Transferable speech-to-text large language model alignment module

Boyong Wu; Chao Yan; Haoran Pu

Transferable speech-to-text large language model alignment module

Boyong Wu, Chao Yan, Haoran Pu

TL;DR

Empirical results reveal that modal alignment can be achieved with one layer module and hundred hours of speech-text multitask corpus, and the alignment subspace revealed by singular value decomposition (SVD) implies linear alignment subspace is sparse, which leaves the possibility to concatenate other features like voice-print or video to expand modality.

Abstract

By leveraging the power of Large Language Models(LLMs) and speech foundation models, state of the art speech-text bimodal works can achieve challenging tasks like spoken translation(ST) and question answering(SQA) altogether with much simpler architectures. In this paper, we utilize the capability of Whisper encoder and pre-trained Yi-6B. Empirical results reveal that modal alignment can be achieved with one layer module and hundred hours of speech-text multitask corpus. We further swap the Yi-6B with human preferences aligned version of Yi-6B-Chat during inference, and discover that the alignment capability is applicable as well. In addition, the alignment subspace revealed by singular value decomposition(SVD) also implies linear alignment subspace is sparse, which leaves the possibility to concatenate other features like voice-print or video to expand modality.

Transferable speech-to-text large language model alignment module

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 2 figures, 3 tables)

This paper contains 16 sections, 2 equations, 2 figures, 3 tables.

Introduction
Approach and Experiment Setup
Model Architecture
Prompt design
Training strategy
Modal alignment
Extensibility of the alignment module
Alignment mapping feature analysis
Experiments setup
Experimental data
Parameter settings
Results and Analysis
Evaluation
Alignment module's transfer-ability across LLMs
Alignment Feature Analysis
...and 1 more sections

Figures (2)

Figure 1: An overview of our proposed speech-text bimodal architecture. Alignment module is used to map the speech features into text feature space. Speech encoder is frozen all the time. LLM embedding will extract text features form prompt. The speech and text modal features are concatenated as LLM's input.
Figure 2: Cases of speech and plain text input

Transferable speech-to-text large language model alignment module

TL;DR

Abstract

Transferable speech-to-text large language model alignment module

Authors

TL;DR

Abstract

Table of Contents

Figures (2)