Table of Contents
Fetching ...

Prompting Large Language Models with Audio for General-Purpose Speech Summarization

Wonjune Kang, Deb Roy

TL;DR

This work addresses general-purpose speech summarization by leveraging the reasoning capabilities of large language models (LLMs) through an end-to-end pipeline that maps speech to LLM-compatible token representations. The audio encoder is trained with a modality-invariance objective, aided by next-token prediction and distillation losses, while the LLM remains frozen to preserve its capabilities. The approach enables speech summaries across arbitrary domains and styles via prompting, and demonstrates improvements over a cascade of ASR followed by LLM processing on benchmark data. The findings highlight the practical potential of multimodal prompting to extend LLMs into speech-centric tasks without task-specific retraining, offering a flexible path toward style-controlled, domain-general speech summarization. Future work could scale to larger LLMs and more advanced audio encoders to further boost performance.

Abstract

In this work, we introduce a framework for speech summarization that leverages the processing and reasoning capabilities of large language models (LLMs). We propose an end-to-end system that combines an instruction-tuned LLM with an audio encoder that converts speech into token representations that the LLM can interpret. Using a dataset with paired speech-text data, the overall system is trained to generate consistent responses to prompts with the same semantic information regardless of the input modality. The resulting framework allows the LLM to process speech inputs in the same way as text, enabling speech summarization by simply prompting the LLM. Unlike prior approaches, our method is able to summarize spoken content from any arbitrary domain, and it can produce summaries in different styles by varying the LLM prompting strategy. Experiments demonstrate that our approach outperforms a cascade baseline of speech recognition followed by LLM text processing.

Prompting Large Language Models with Audio for General-Purpose Speech Summarization

TL;DR

This work addresses general-purpose speech summarization by leveraging the reasoning capabilities of large language models (LLMs) through an end-to-end pipeline that maps speech to LLM-compatible token representations. The audio encoder is trained with a modality-invariance objective, aided by next-token prediction and distillation losses, while the LLM remains frozen to preserve its capabilities. The approach enables speech summaries across arbitrary domains and styles via prompting, and demonstrates improvements over a cascade of ASR followed by LLM processing on benchmark data. The findings highlight the practical potential of multimodal prompting to extend LLMs into speech-centric tasks without task-specific retraining, offering a flexible path toward style-controlled, domain-general speech summarization. Future work could scale to larger LLMs and more advanced audio encoders to further boost performance.

Abstract

In this work, we introduce a framework for speech summarization that leverages the processing and reasoning capabilities of large language models (LLMs). We propose an end-to-end system that combines an instruction-tuned LLM with an audio encoder that converts speech into token representations that the LLM can interpret. Using a dataset with paired speech-text data, the overall system is trained to generate consistent responses to prompts with the same semantic information regardless of the input modality. The resulting framework allows the LLM to process speech inputs in the same way as text, enabling speech summarization by simply prompting the LLM. Unlike prior approaches, our method is able to summarize spoken content from any arbitrary domain, and it can produce summaries in different styles by varying the LLM prompting strategy. Experiments demonstrate that our approach outperforms a cascade baseline of speech recognition followed by LLM text processing.
Paper Structure (15 sections, 4 equations, 2 figures, 3 tables)

This paper contains 15 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The audio encoder is trained to produce token embeddings such that the LLM's response to a spoken prompt matches its response to a text prompt with the same semantic content.
  • Figure 2: Speech summarization can be performed by prompting the LLM using any combination of text and speech inputs.