Table of Contents
Fetching ...

CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching

Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

TL;DR

CS3-Bench targets language alignment in code-switching Mandarin–English speech-to-speech LLMs, revealing pronounced degradation in knowledge queries and open-ended conversations across 7 models. The authors propose data-driven language-alignment and training strategies, notably Chain of Recognition (CoR) and Keyword Highlighting (KH), combined with multilingual data and LoRA fine-tuning to guide generation. Their approach yields substantial gains in knowledge accuracy and open-ended understanding, bridging the gap in code-switching performance and reducing pronunciation errors. The benchmark and methods provide a practical framework for developing robust, bilingual speech interaction systems with real-world applications in bilingual assistants and multilingual AI.

Abstract

The advancement of multimodal large language models has accelerated the development of speech-to-speech interaction systems. While natural monolingual interaction has been achieved, we find existing models exhibit deficiencies in language alignment. In our proposed Code-Switching Speech-to-Speech Benchmark (CS3-Bench), experiments on 7 mainstream models demonstrate a relative performance drop of up to 66% in knowledge-intensive question answering and varying degrees of misunderstanding in open-ended conversations. Starting from a model with severe performance deterioration, we propose both data constructions and training approaches to improve the language alignment capabilities, specifically employing Chain of Recognition (CoR) to enhance understanding and Keyword Highlighting (KH) to guide generation. Our approach improves the knowledge accuracy from 25.14% to 46.13%, with open-ended understanding rate from 64.5% to 86.5%, and significantly reduces pronunciation errors in the secondary language. CS3-Bench is available at https://huggingface.co/datasets/VocalNet/CS3-Bench.

CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching

TL;DR

CS3-Bench targets language alignment in code-switching Mandarin–English speech-to-speech LLMs, revealing pronounced degradation in knowledge queries and open-ended conversations across 7 models. The authors propose data-driven language-alignment and training strategies, notably Chain of Recognition (CoR) and Keyword Highlighting (KH), combined with multilingual data and LoRA fine-tuning to guide generation. Their approach yields substantial gains in knowledge accuracy and open-ended understanding, bridging the gap in code-switching performance and reducing pronunciation errors. The benchmark and methods provide a practical framework for developing robust, bilingual speech interaction systems with real-world applications in bilingual assistants and multilingual AI.

Abstract

The advancement of multimodal large language models has accelerated the development of speech-to-speech interaction systems. While natural monolingual interaction has been achieved, we find existing models exhibit deficiencies in language alignment. In our proposed Code-Switching Speech-to-Speech Benchmark (CS3-Bench), experiments on 7 mainstream models demonstrate a relative performance drop of up to 66% in knowledge-intensive question answering and varying degrees of misunderstanding in open-ended conversations. Starting from a model with severe performance deterioration, we propose both data constructions and training approaches to improve the language alignment capabilities, specifically employing Chain of Recognition (CoR) to enhance understanding and Keyword Highlighting (KH) to guide generation. Our approach improves the knowledge accuracy from 25.14% to 46.13%, with open-ended understanding rate from 64.5% to 86.5%, and significantly reduces pronunciation errors in the secondary language. CS3-Bench is available at https://huggingface.co/datasets/VocalNet/CS3-Bench.

Paper Structure

This paper contains 14 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The code-switching deterioration in conversations.
  • Figure 2: The methods for code-switching in interactive speech LLMs. a) The creation pipeline for the code-switching corpus. b) The model architecture for VocalNet baseline. c) The training strategy with the chain of recognition and keyword highlighting.