Table of Contents
Fetching ...

Speech-Worthy Alignment for Japanese SpeechLLMs via Direct Preference Optimization

Mengjie Zhao, Lianbo Liu, Yusuke Fujita, Hao Shi, Yuan Gao, Roman Koshkin, Yui Sudo

Abstract

SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substantially in politeness markers, sentence-final particles, and syntactic complexity. We propose a preference-based alignment approach to adapt Japanese SpeechLLMs for speech-worthy outputs: text that is concise, conversational, and readily synthesized as natural speech. To rigorously evaluate this task, we introduce SpokenElyza, a benchmark for Japanese speech-worthiness derived from ELYZA-tasks-100 with auditory verification by native experts. Experiments show that our approach achieves substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation. We will release SpokenElyza to support future research on Japanese spoken dialog systems.

Speech-Worthy Alignment for Japanese SpeechLLMs via Direct Preference Optimization

Abstract

SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substantially in politeness markers, sentence-final particles, and syntactic complexity. We propose a preference-based alignment approach to adapt Japanese SpeechLLMs for speech-worthy outputs: text that is concise, conversational, and readily synthesized as natural speech. To rigorously evaluate this task, we introduce SpokenElyza, a benchmark for Japanese speech-worthiness derived from ELYZA-tasks-100 with auditory verification by native experts. Experiments show that our approach achieves substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation. We will release SpokenElyza to support future research on Japanese spoken dialog systems.
Paper Structure (13 sections, 2 equations, 2 figures, 4 tables)

This paper contains 13 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of our speech-worthy alignment. SpeechLLM responses are aligned from written style (verbose, symbol-heavy) to spoken style (concise, conversational) for auditory comprehension and TTS synthesis. English translations at right; non-speech-worthy elements are highlighted in red.
  • Figure 2: Effect of DPO loss weight on LLM-as-Judge scores. Higher DPO weights improve SpokenElyza performance across all datasets. ALL: combination of all preference training datasets.