Table of Contents
Fetching ...

Seal: Advancing Speech Language Models to be Few-Shot Learners

Shuyu Lei, Lingen Liu, Jiaolong Yang, Yasen Jiao, Yuxiang Yang, Yushu Yang, Xiang Guo

TL;DR

Seal advances speech-language understanding by bridging a frozen speech encoder with a frozen LM through a trainable projector trained with KL-divergence to align $P(o|s)$ and $P(o|t)$, effectively simulating induction heads. Pretraining on 6,600+ hours from Common Voice and GigaSpeech is followed by evaluation on FSC and SLURP, where Seal-Phi2/Seal-Phi3 achieve robust few-shot performance and benefit from KATE-based in-context example selection, with consistency across backends $\phi_2$ and $\phi_3$. The work introduces the Seal architecture—frozen speech encoder, trainable projector, frozen language model—and demonstrates that KL-based alignment can enable a data-efficient speech few-shot learner. Overall, the method reduces reliance on large ASR pipelines and shows potential for broader multimodal tasks.

Abstract

Existing auto-regressive language models have demonstrated a remarkable capability to perform a new task with just a few examples in prompt, without requiring any additional training. In order to extend this capability to a multi-modal setting (i.e. speech and language), this paper introduces the Seal model, an abbreviation for speech language model. It incorporates a novel alignment method, in which Kullback-Leibler divergence loss is performed to train a projector that bridges a frozen speech encoder with a frozen language model decoder. The resulting Seal model exhibits robust performance as a few-shot learner on two speech understanding tasks. Additionally, consistency experiments are conducted to validate its robustness on different pre-trained language models.

Seal: Advancing Speech Language Models to be Few-Shot Learners

TL;DR

Seal advances speech-language understanding by bridging a frozen speech encoder with a frozen LM through a trainable projector trained with KL-divergence to align and , effectively simulating induction heads. Pretraining on 6,600+ hours from Common Voice and GigaSpeech is followed by evaluation on FSC and SLURP, where Seal-Phi2/Seal-Phi3 achieve robust few-shot performance and benefit from KATE-based in-context example selection, with consistency across backends and . The work introduces the Seal architecture—frozen speech encoder, trainable projector, frozen language model—and demonstrates that KL-based alignment can enable a data-efficient speech few-shot learner. Overall, the method reduces reliance on large ASR pipelines and shows potential for broader multimodal tasks.

Abstract

Existing auto-regressive language models have demonstrated a remarkable capability to perform a new task with just a few examples in prompt, without requiring any additional training. In order to extend this capability to a multi-modal setting (i.e. speech and language), this paper introduces the Seal model, an abbreviation for speech language model. It incorporates a novel alignment method, in which Kullback-Leibler divergence loss is performed to train a projector that bridges a frozen speech encoder with a frozen language model decoder. The resulting Seal model exhibits robust performance as a few-shot learner on two speech understanding tasks. Additionally, consistency experiments are conducted to validate its robustness on different pre-trained language models.
Paper Structure (13 sections, 2 equations, 1 figure, 2 tables)

This paper contains 13 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: An overview of the Seal model is presented. The Seal model consists of three components: a frozen speech encoder, a trainable projector, and a frozen language model. The Kullback-Leibler divergence loss is used to train the projector, as shown in Equation \ref{['eq1']}. Specifically, $j+1$ duplicate transcripts, separated by '\\ n', are forwarded to produce the target distribution. This produced distribution then guides the Seal model to simulate the induction head as introduced in induction.