Table of Contents
Fetching ...

Efficient Interleaved Speech Modeling through Knowledge Distillation

Mohammadmahdi Nouriborji, Morteza Rohanian

TL;DR

TinyWave tackles the challenge of deploying expressive speech generation under tight compute and latency constraints by introducing a layer-aligned distillation framework that transfers knowledge from a large SpiritLM-based teacher to a compact 2B student. By fine-tuning the teacher briefly on target-domain data and applying layer-to-layer alignment of hidden states, attention, and softened logits, TinyWave preserves prosody and expressive cues while enabling real-time on-device usage. Across NPS, StoryCloze, SALMon, and Libri-Light continuations, the 2B models approach teacher performance and outperform size-matched baselines, illustrating that careful distillation can deliver high-quality, expressive speech generation on commodity hardware. The work provides open-source training code and models to foster reproducibility and practical deployment in conversational agents and assistive technologies.

Abstract

Current speech language models exceed the size and latency constraints of many deployment environments. We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x with minimal loss in performance. We introduce TinyWave, a family of 2B-parameter models for speech-to-speech and interleaved speech-text generation, trained on 50,000 hours of public audio. TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations. Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97% of the teacher's performance, outperforming size-matched baselines. These models are optimized for deployment on commodity hardware, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments. We release models, training code, and evaluation scripts to support reproducible research on compact, expressive speech generation.

Efficient Interleaved Speech Modeling through Knowledge Distillation

TL;DR

TinyWave tackles the challenge of deploying expressive speech generation under tight compute and latency constraints by introducing a layer-aligned distillation framework that transfers knowledge from a large SpiritLM-based teacher to a compact 2B student. By fine-tuning the teacher briefly on target-domain data and applying layer-to-layer alignment of hidden states, attention, and softened logits, TinyWave preserves prosody and expressive cues while enabling real-time on-device usage. Across NPS, StoryCloze, SALMon, and Libri-Light continuations, the 2B models approach teacher performance and outperform size-matched baselines, illustrating that careful distillation can deliver high-quality, expressive speech generation on commodity hardware. The work provides open-source training code and models to foster reproducibility and practical deployment in conversational agents and assistive technologies.

Abstract

Current speech language models exceed the size and latency constraints of many deployment environments. We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x with minimal loss in performance. We introduce TinyWave, a family of 2B-parameter models for speech-to-speech and interleaved speech-text generation, trained on 50,000 hours of public audio. TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations. Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97% of the teacher's performance, outperforming size-matched baselines. These models are optimized for deployment on commodity hardware, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments. We release models, training code, and evaluation scripts to support reproducible research on compact, expressive speech generation.

Paper Structure

This paper contains 16 sections, 4 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Distillation framework: The student aligns with the teacher’s logits, intermediate states, and ground-truth labels, guided by a multi-part loss.