Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean
Chanwoo Park, Suyoung Park, JiA Kang, Jongyeon Park, Sangho Kim, Hyunji M. Park, Sumin Bae, Mingyu Kang, Jaejin Lee
TL;DR
Ko-MuSR presents the first Korean benchmark for long-context multistep soft reasoning, designed to avoid training data contamination and aligned with the MuSR framework. By synthesizing fully Korean narratives, reasoning chains, and human-verified questions, it enables systematic evaluation of LLMs and prompting strategies in a multilingual context. The study finds that multilingual LLMs often outperform Korean-specialized models on Korean reasoning tasks, and carefully designed prompting (few-shot, CoT, and hints) can approach human-level accuracy, though small language models show inconsistent gains. The benchmark thus provides a solid foundation for advancing Korean NLP, offering insights into cross-language reasoning transfer and prompting strategies to improve long-context reasoning across languages.
Abstract
We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models -- two multilingual and two Korean-specialized -- show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.
