Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean

Chanwoo Park; Suyoung Park; JiA Kang; Jongyeon Park; Sangho Kim; Hyunji M. Park; Sumin Bae; Mingyu Kang; Jaejin Lee

Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean

Chanwoo Park, Suyoung Park, JiA Kang, Jongyeon Park, Sangho Kim, Hyunji M. Park, Sumin Bae, Mingyu Kang, Jaejin Lee

TL;DR

Ko-MuSR presents the first Korean benchmark for long-context multistep soft reasoning, designed to avoid training data contamination and aligned with the MuSR framework. By synthesizing fully Korean narratives, reasoning chains, and human-verified questions, it enables systematic evaluation of LLMs and prompting strategies in a multilingual context. The study finds that multilingual LLMs often outperform Korean-specialized models on Korean reasoning tasks, and carefully designed prompting (few-shot, CoT, and hints) can approach human-level accuracy, though small language models show inconsistent gains. The benchmark thus provides a solid foundation for advancing Korean NLP, offering insights into cross-language reasoning transfer and prompting strategies to improve long-context reasoning across languages.

Abstract

We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models -- two multilingual and two Korean-specialized -- show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.

Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean

TL;DR

Abstract

Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (52)