Table of Contents
Fetching ...

Studying the Korean Word-Chain Game with RLVR: Mitigating Reward Conflicts via Curriculum Learning

Donghwan Rho

TL;DR

The paper analyzes reinforcement learning with verifiable rewards (RLVR) applied to the Korean word-chain puzzle and reveals intrinsic conflicts between rule-derived rewards. It shows that naive RLVR with the full rule set fails to train effectively, but a curriculum-learning approach—including data-reordering and staged exposure to rule complexity—mitigates these conflicts and improves learning. The authors demonstrate that initial-sound rule acquisition is accelerated through a two-stage curriculum and targeted data sampling, yielding higher win rates and longer, more accurate chains against a dictionary. This work highlights the feasibility and value of studying non-English puzzle tasks to advance reasoning capabilities in large language models and motivates broader cross-linguistic puzzle research with RLVR-curiculum methods.

Abstract

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training large language models (LLMs) with stronger reasoning abilities. It has also been applied to a variety of logic puzzles. In this work, we study the Korean word-chain game using RLVR. We show that rule-derived rewards can naturally conflict, and demonstrate through experiments that a curriculum-learning scheme mitigates these conflicts. Our findings motivate further studies of puzzle tasks in diverse languages.

Studying the Korean Word-Chain Game with RLVR: Mitigating Reward Conflicts via Curriculum Learning

TL;DR

The paper analyzes reinforcement learning with verifiable rewards (RLVR) applied to the Korean word-chain puzzle and reveals intrinsic conflicts between rule-derived rewards. It shows that naive RLVR with the full rule set fails to train effectively, but a curriculum-learning approach—including data-reordering and staged exposure to rule complexity—mitigates these conflicts and improves learning. The authors demonstrate that initial-sound rule acquisition is accelerated through a two-stage curriculum and targeted data sampling, yielding higher win rates and longer, more accurate chains against a dictionary. This work highlights the feasibility and value of studying non-English puzzle tasks to advance reasoning capabilities in large language models and motivates broader cross-linguistic puzzle research with RLVR-curiculum methods.

Abstract

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training large language models (LLMs) with stronger reasoning abilities. It has also been applied to a variety of logic puzzles. In this work, we study the Korean word-chain game using RLVR. We show that rule-derived rewards can naturally conflict, and demonstrate through experiments that a curriculum-learning scheme mitigates these conflicts. Our findings motivate further studies of puzzle tasks in diverse languages.

Paper Structure

This paper contains 18 sections, 2 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Accuracy by final-syllable category. For example, the leftmost graph shows the accuracy of model answers for words which ends with "념". In this figure, '념', '력', '론', '륙' are syllables to which we can apply the initial-sound rule. While the baseline fails to learn or learns the initial-sound rule too slowly, our methods ISR, DR, and OS (see Section \ref{['sec:results_analysis']}) accelerate acquiring the rule.
  • Figure 2: A word-chain game against the dictionary. The dictionary starts with a random word (e.g., "사랑"). Then this word is sent to the model together with the prompt. The model predicts the next word ("낭만", applying the initial-sound rule). If the next word satisfies the rules, the dictionary chooses another noun and the game continues until either the dictionary cannot supply a valid word or the model violates a rule. In this game, the model fails to answer the third word, and the model turn is 2. The meaning of Korean words: "사랑": love, "낭만": romance, "만두": dumpling, "두뇌": brain, "뇌우": thunderstorm, "주식": stock.
  • Figure 3: (Left) the averaged model turns and (Right) the average winning rate of the model in the word-chain game versus the dictionary. According to the proposed methods, both of metrics increase.
  • Figure 4: Failure cases in the word-chain game against the dictionary. (a) the answer does not obey the initial-sound rule; (b) the answer starts with the syllable of the previous word that is different from the last one (Section \ref{['sec:other_syl']}); (c) the answer is one of the previous words; (d) the answer does not follow the rule (i) in Section \ref{['sec:korean_word_chain']}; (e) the answer is not a noun.
  • Figure 5: The flow chart representing the classification of the failure modes.