Multi-Step Reasoning in Korean and the Emergent Mirage
Guijin Son, Hyunwoo Ko, Dasol Choi
TL;DR
The paper addresses the gap in culturally grounded, multi-step reasoning benchmarks for Korean by introducing HRMCR, a two-subset benchmark (Date and Zodiac) generated algorithmically with built-in stepwise solvers. It evaluates 20 LLMs, observing a compute-based threshold around $2 \cdot 10^{25}$ FLOPs to achieve non-zero performance, with SOTA models remaining below 50% accuracy and initial gains potentially explained by error accumulation rather than true emergence. The work provides a public question set (but not generation code) to curb contamination, and it introduces an automatic explanation generator to facilitate error analysis across reasoning steps. Overall, HRMCR reveals the difficulty of integrating cultural knowledge into multi-step reasoning and urges careful interpretation of emergent behavior, while committing to dataset updates to maintain robustness for future benchmarking.
Abstract
We introduce HRMCR (HAE-RAE Multi-Step Commonsense Reasoning), a benchmark designed to evaluate large language models' ability to perform multi-step reasoning in culturally specific contexts, focusing on Korean. The questions are automatically generated via templates and algorithms, requiring LLMs to integrate Korean cultural knowledge into sequential reasoning steps. Consistent with prior observations on emergent abilities, our experiments reveal that models trained on fewer than \(2 \cdot 10^{25}\) training FLOPs struggle to solve any questions, showing near-zero performance. Beyond this threshold, performance improves sharply. State-of-the-art models (e.g., O1) still score under 50\%, underscoring the difficulty of our tasks. Notably, stepwise analysis suggests the observed emergent behavior may stem from compounding errors across multiple steps rather than reflecting a genuinely new capability. We publicly release the benchmark and commit to regularly updating the dataset to prevent contamination.
