Table of Contents
Fetching ...

S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

Yuanbo Fang, Haoze Sun, Jun Liu, Tao Zhang, Zenan Zhou, Weipeng Chen, Xiaofen Xing, Xiangmin Xu

TL;DR

S2SBench addresses the gap in evaluating intelligence degradation when moving from text to end-to-end speech LLMs by introducing a pairwise perplexity-based benchmark across sentence continuation and commonsense reasoning tasks. It constructs cross-modal datasets (text and audio) and validates a two-stage training regime on Baichuan-Audio to mitigate degradation, offering a practical framework for diagnosing and guiding speech-based LLM development. The work provides a structured methodology for cross-modal evaluation and reveals insights into training dynamics, modality gaps, and language-specific challenges, with implications for improving robust reasoning in speech-enabled models.

Abstract

End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens. However, this often leads to a decline in reasoning and generation performance compared to text input, a phenomenon referred to as intelligence degradation. To systematically evaluate this gap, we propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs. It includes diagnostic datasets targeting sentence continuation and commonsense reasoning under audio input. We further introduce a pairwise evaluation protocol based on perplexity differences between plausible and implausible samples to measure degradation relative to text input. We apply S2SBench to analyze the training process of Baichuan-Audio, which further demonstrates the benchmark's effectiveness. All datasets and evaluation code are available at https://github.com/undobug/S2SBench.

S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

TL;DR

S2SBench addresses the gap in evaluating intelligence degradation when moving from text to end-to-end speech LLMs by introducing a pairwise perplexity-based benchmark across sentence continuation and commonsense reasoning tasks. It constructs cross-modal datasets (text and audio) and validates a two-stage training regime on Baichuan-Audio to mitigate degradation, offering a practical framework for diagnosing and guiding speech-based LLM development. The work provides a structured methodology for cross-modal evaluation and reveals insights into training dynamics, modality gaps, and language-specific challenges, with implications for improving robust reasoning in speech-enabled models.

Abstract

End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens. However, this often leads to a decline in reasoning and generation performance compared to text input, a phenomenon referred to as intelligence degradation. To systematically evaluate this gap, we propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs. It includes diagnostic datasets targeting sentence continuation and commonsense reasoning under audio input. We further introduce a pairwise evaluation protocol based on perplexity differences between plausible and implausible samples to measure degradation relative to text input. We apply S2SBench to analyze the training process of Baichuan-Audio, which further demonstrates the benchmark's effectiveness. All datasets and evaluation code are available at https://github.com/undobug/S2SBench.

Paper Structure

This paper contains 14 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Architectural types of end-to-end Speech LLMs: (a) Partial end-to-end, (b) Interleaved fully end-to-end, and (c) Parallel fully end-to-end.
  • Figure 2: Evaluation pipeline for assessing the intelligence capability of large language models. The model architecture and reasoning task are identical under both text and audio input conditions.
  • Figure 3: Speech-to-text with single-stage training.
  • Figure 4: Text-to-text with single-stage training.
  • Figure 5: Speech-to-text with two-stage training (Stage 1).
  • ...and 2 more figures