Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset

Ryoma Suzuki; Zhiyang Qi; Michimasa Inaba

Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset

Ryoma Suzuki, Zhiyang Qi, Michimasa Inaba

Abstract

To address the critical scarcity of high-quality, publicly available counseling dialogue datasets, we created Multilingual KokoroChat by translating KokoroChat, a large-scale manually authored Japanese counseling corpus, into both English and Chinese. A key challenge in this process is that the optimal model for translation varies by input, making it impossible for any single model to consistently guarantee the highest quality. In a sensitive domain like counseling, where the highest possible translation fidelity is essential, relying on a single LLM is therefore insufficient. To overcome this challenge, we developed and employed a novel multi-LLM ensemble method. Our approach first generates diverse hypotheses from multiple distinct LLMs. A single LLM then produces a high-quality translation based on an analysis of the respective strengths and weaknesses of all presented hypotheses. The quality of ``Multilingual KokoroChat'' was rigorously validated through human preference studies. These evaluations confirmed that the translations produced by our ensemble method were preferred from any individual state-of-the-art LLM. This strong preference confirms the superior quality of our method's outputs. The Multilingual KokoroChat is available at https://github.com/UEC-InabaLab/MultilingualKokoroChat.

Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset

Abstract

Paper Structure (25 sections, 8 figures, 5 tables)

This paper contains 25 sections, 8 figures, 5 tables.

Introduction
Related Work
Psychological Counseling Datasets
LLM Ensemble
KokoroChat
Multi-LLM Ensemble Translation
Diverse Hypothesis Generation
Integration and Refinement
Experiments
Experimental Settings
Automated Evaluation
Results of Automated Evaluation
Human Evaluation
Results of Human Evaluation
Case Study
...and 10 more sections

Figures (8)

Figure 1: Proposed Multi-LLM Ensemble Translation Method
Figure 2: Human Evaluation Results for Japanese-to-English Translation
Figure 3: Human Evaluation Results for Japanese-to-Chinese Translation
Figure 4: Comparison of Japanese-to-English translation where the proposed method was judged inferior to Grok. This judge resulted from a contextual mismatch inherent in the randomized experimental design: the use of "pathetic" for internal consistency conflicted with the term "uncool" in the context randomly assigned to the evaluators.
Figure 5: Japanese-to-Chinese translation example 1. This demonstrates the analysis and synthesis process that produced a final translation highly preferred by human evaluators over any of the three single-LLM hypotheses.
...and 3 more figures

Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset

Abstract

Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset

Authors

Abstract

Table of Contents

Figures (8)