Table of Contents
Fetching ...

Multi-Step Reasoning in Korean and the Emergent Mirage

Guijin Son, Hyunwoo Ko, Dasol Choi

TL;DR

The paper addresses the gap in culturally grounded, multi-step reasoning benchmarks for Korean by introducing HRMCR, a two-subset benchmark (Date and Zodiac) generated algorithmically with built-in stepwise solvers. It evaluates 20 LLMs, observing a compute-based threshold around $2 \cdot 10^{25}$ FLOPs to achieve non-zero performance, with SOTA models remaining below 50% accuracy and initial gains potentially explained by error accumulation rather than true emergence. The work provides a public question set (but not generation code) to curb contamination, and it introduces an automatic explanation generator to facilitate error analysis across reasoning steps. Overall, HRMCR reveals the difficulty of integrating cultural knowledge into multi-step reasoning and urges careful interpretation of emergent behavior, while committing to dataset updates to maintain robustness for future benchmarking.

Abstract

We introduce HRMCR (HAE-RAE Multi-Step Commonsense Reasoning), a benchmark designed to evaluate large language models' ability to perform multi-step reasoning in culturally specific contexts, focusing on Korean. The questions are automatically generated via templates and algorithms, requiring LLMs to integrate Korean cultural knowledge into sequential reasoning steps. Consistent with prior observations on emergent abilities, our experiments reveal that models trained on fewer than \(2 \cdot 10^{25}\) training FLOPs struggle to solve any questions, showing near-zero performance. Beyond this threshold, performance improves sharply. State-of-the-art models (e.g., O1) still score under 50\%, underscoring the difficulty of our tasks. Notably, stepwise analysis suggests the observed emergent behavior may stem from compounding errors across multiple steps rather than reflecting a genuinely new capability. We publicly release the benchmark and commit to regularly updating the dataset to prevent contamination.

Multi-Step Reasoning in Korean and the Emergent Mirage

TL;DR

The paper addresses the gap in culturally grounded, multi-step reasoning benchmarks for Korean by introducing HRMCR, a two-subset benchmark (Date and Zodiac) generated algorithmically with built-in stepwise solvers. It evaluates 20 LLMs, observing a compute-based threshold around FLOPs to achieve non-zero performance, with SOTA models remaining below 50% accuracy and initial gains potentially explained by error accumulation rather than true emergence. The work provides a public question set (but not generation code) to curb contamination, and it introduces an automatic explanation generator to facilitate error analysis across reasoning steps. Overall, HRMCR reveals the difficulty of integrating cultural knowledge into multi-step reasoning and urges careful interpretation of emergent behavior, while committing to dataset updates to maintain robustness for future benchmarking.

Abstract

We introduce HRMCR (HAE-RAE Multi-Step Commonsense Reasoning), a benchmark designed to evaluate large language models' ability to perform multi-step reasoning in culturally specific contexts, focusing on Korean. The questions are automatically generated via templates and algorithms, requiring LLMs to integrate Korean cultural knowledge into sequential reasoning steps. Consistent with prior observations on emergent abilities, our experiments reveal that models trained on fewer than training FLOPs struggle to solve any questions, showing near-zero performance. Beyond this threshold, performance improves sharply. State-of-the-art models (e.g., O1) still score under 50\%, underscoring the difficulty of our tasks. Notably, stepwise analysis suggests the observed emergent behavior may stem from compounding errors across multiple steps rather than reflecting a genuinely new capability. We publicly release the benchmark and commit to regularly updating the dataset to prevent contamination.
Paper Structure (28 sections, 7 figures, 3 tables)

This paper contains 28 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The X-axis represents the training compute scale in ExaFLOPs (10$^ {18}$ floating-point operations), calculated as $6 \times \text{\#parameters} \times \text{\#tokens}$ following kaplan2020scaling. The Y-axis indicates the performance of the models on HRMCR.
  • Figure 2: Example of generated questions in the HRMCR benchmark. The figure showcases generated questions (left) alongside their automatically generated solutions (right). The top panel represents the "date" subset, while the bottom corresponds to the "zodiac" subset. Questions are translated into Korean to enhance accessibility.
  • Figure 3: Breakdown of performance results for selected steps in the Zodiac subset. The green line represents the regression line, the blue points are instances used for fitting the regression, and the red points represent the test set. Only at the last step, the regression fails. For the entire results, see Appendix \ref{['app_additional_results']}.
  • Figure 4: Prompt used for judging responses.
  • Figure 5: A full example including a sample question, gold solution, model response (by o1) and judgment (by GPT-4o).
  • ...and 2 more figures