Seal: Advancing Speech Language Models to be Few-Shot Learners

Shuyu Lei; Lingen Liu; Jiaolong Yang; Yasen Jiao; Yuxiang Yang; Yushu Yang; Xiang Guo

Seal: Advancing Speech Language Models to be Few-Shot Learners

Shuyu Lei, Lingen Liu, Jiaolong Yang, Yasen Jiao, Yuxiang Yang, Yushu Yang, Xiang Guo

TL;DR

Seal advances speech-language understanding by bridging a frozen speech encoder with a frozen LM through a trainable projector trained with KL-divergence to align $P(o|s)$ and $P(o|t)$, effectively simulating induction heads. Pretraining on 6,600+ hours from Common Voice and GigaSpeech is followed by evaluation on FSC and SLURP, where Seal-Phi2/Seal-Phi3 achieve robust few-shot performance and benefit from KATE-based in-context example selection, with consistency across backends $\phi_2$ and $\phi_3$. The work introduces the Seal architecture—frozen speech encoder, trainable projector, frozen language model—and demonstrates that KL-based alignment can enable a data-efficient speech few-shot learner. Overall, the method reduces reliance on large ASR pipelines and shows potential for broader multimodal tasks.

Abstract

Existing auto-regressive language models have demonstrated a remarkable capability to perform a new task with just a few examples in prompt, without requiring any additional training. In order to extend this capability to a multi-modal setting (i.e. speech and language), this paper introduces the Seal model, an abbreviation for speech language model. It incorporates a novel alignment method, in which Kullback-Leibler divergence loss is performed to train a projector that bridges a frozen speech encoder with a frozen language model decoder. The resulting Seal model exhibits robust performance as a few-shot learner on two speech understanding tasks. Additionally, consistency experiments are conducted to validate its robustness on different pre-trained language models.

Seal: Advancing Speech Language Models to be Few-Shot Learners

TL;DR

Seal advances speech-language understanding by bridging a frozen speech encoder with a frozen LM through a trainable projector trained with KL-divergence to align

and

, effectively simulating induction heads. Pretraining on 6,600+ hours from Common Voice and GigaSpeech is followed by evaluation on FSC and SLURP, where Seal-Phi2/Seal-Phi3 achieve robust few-shot performance and benefit from KATE-based in-context example selection, with consistency across backends

and

. The work introduces the Seal architecture—frozen speech encoder, trainable projector, frozen language model—and demonstrates that KL-based alignment can enable a data-efficient speech few-shot learner. Overall, the method reduces reliance on large ASR pipelines and shows potential for broader multimodal tasks.

Abstract

Paper Structure (13 sections, 2 equations, 1 figure, 2 tables)

This paper contains 13 sections, 2 equations, 1 figure, 2 tables.

Introduction
Methodology
Speech encoder
Projector
Language model
Training method
Experiments
Pre-train Implementations
Details of available datasets
Few-shot learning on two speech understanding tasks
Discussion
Conclusion
Future work

Figures (1)

Figure 1: An overview of the Seal model is presented. The Seal model consists of three components: a frozen speech encoder, a trainable projector, and a frozen language model. The Kullback-Leibler divergence loss is used to train the projector, as shown in Equation \ref{['eq1']}. Specifically, $j+1$ duplicate transcripts, separated by '\\ n', are forwarded to produce the target distribution. This produced distribution then guides the Seal model to simulate the induction head as introduced in induction.

Seal: Advancing Speech Language Models to be Few-Shot Learners

TL;DR

Abstract

Seal: Advancing Speech Language Models to be Few-Shot Learners

Authors

TL;DR

Abstract

Table of Contents

Figures (1)