Table of Contents
Fetching ...

HAMMER: Hamiltonian Curiosity Augmented Large Language Model Reinforcement

Ming Yang, Xiaofan Li, Zhiyuan Ma, Dengliang Shi, Jintao Du, Yu Cheng, Weiguo Zheng

TL;DR

HAMMER addresses instability in reinforcement learning for large language models with verifiable rewards by introducing a diversity-driven curriculum that orders training data via a minimum-semantic Hamiltonian cycle. It combines semantic embeddings derived from the backbone LLM with an η-greedy heuristic to construct a Hamiltonian Curiosity Order, promoting early exploration and smoother optimization. The authors provide learning-theoretic guarantees showing that diverse, early subsets preserve the optimal policy while tightening generalization bounds, and they show that the minimum semantic cycle corresponds to maximizing dataset diversity. Empirically, HAMMER consistently yields 3-4% average accuracy gains across math benchmarks (AIME, AMC, Olympiad) and across model scales, demonstrating improved sample efficiency and training stability in RLVR settings.

Abstract

Recent curriculum reinforcement learning for large language models (LLMs) typically rely on difficulty-based annotations for data filtering and ordering. However, such methods suffer from local optimization, where continual training on simple samples in the early steps can cause the policy to lose its exploration. We propose a novel schema, namely Hamiltonian curiosity augmented large language model reinforcement (HAMMER), that transfers diversity metrics, commonly used in dataset evaluation, into the dynamic reinforcement learning procedure, where training samples are ordered via a minimum-semantic Hamiltonian path making the initial training retrain more exploration. From a theoretical perspective of generalization bounds, diversity-driven ordering facilitates stable convergence. Empirical evaluations indicate that HAMMER stimulates model "curiosity" and consistently achieves a 3% to 4% average accuracy gain across diverse inference benchmark.

HAMMER: Hamiltonian Curiosity Augmented Large Language Model Reinforcement

TL;DR

HAMMER addresses instability in reinforcement learning for large language models with verifiable rewards by introducing a diversity-driven curriculum that orders training data via a minimum-semantic Hamiltonian cycle. It combines semantic embeddings derived from the backbone LLM with an η-greedy heuristic to construct a Hamiltonian Curiosity Order, promoting early exploration and smoother optimization. The authors provide learning-theoretic guarantees showing that diverse, early subsets preserve the optimal policy while tightening generalization bounds, and they show that the minimum semantic cycle corresponds to maximizing dataset diversity. Empirically, HAMMER consistently yields 3-4% average accuracy gains across math benchmarks (AIME, AMC, Olympiad) and across model scales, demonstrating improved sample efficiency and training stability in RLVR settings.

Abstract

Recent curriculum reinforcement learning for large language models (LLMs) typically rely on difficulty-based annotations for data filtering and ordering. However, such methods suffer from local optimization, where continual training on simple samples in the early steps can cause the policy to lose its exploration. We propose a novel schema, namely Hamiltonian curiosity augmented large language model reinforcement (HAMMER), that transfers diversity metrics, commonly used in dataset evaluation, into the dynamic reinforcement learning procedure, where training samples are ordered via a minimum-semantic Hamiltonian path making the initial training retrain more exploration. From a theoretical perspective of generalization bounds, diversity-driven ordering facilitates stable convergence. Empirical evaluations indicate that HAMMER stimulates model "curiosity" and consistently achieves a 3% to 4% average accuracy gain across diverse inference benchmark.

Paper Structure

This paper contains 38 sections, 4 theorems, 22 equations, 7 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Given a subset $\mathcal{S} \subset \mathcal{X}$ of $n$ samples, let $\pi^*$ be the optimal policy on $\mathcal{X}$. There exists some $\gamma$ (i.e., $\gamma=2\rho$) such that $\pi^* \in \Pi_{\mathcal{S}}.$

Figures (7)

  • Figure 1: Overview of HAMMER. Given dataset $\mathcal{X}=\{x_i\}_{i=1}^n$, forward propagation through the backbone model yields sentence embeddings $\{e_i\}_{i=1}^n$, where similar ones are closer in embedding space $\mathcal{E}$ with larger similarity $\delta$ (e.g., $x_2, x_4$). Pairwise similarities form $\{\delta(e_i, e_j)\}_{n \times n}$, a complete graph. All paths of the graph consists the Order Space $\mathcal{H}$. The path $\mathcal{P}^* \in \mathcal{H}$ with minimum similarity provides the Hamiltonian Curiosity Order.
  • Figure 2: Validation of pass@k over steps on Qwen3-1.7B DAPO (8192 context).
  • Figure 3: Validation of pass@k over steps on Qwen3-1.7B GRPO (8192 context).
  • Figure 4: Data order and batch size ablation study, where DAPO-MAX denote max semantic similarity data order, DAPO-E2H and DAPO-H2E denote "easy-to-hard" and "hard-to-easy" data order.
  • Figure 5: Distribution of metrics.
  • ...and 2 more figures

Theorems & Definitions (20)

  • Definition 1: Sentence Embedding Space
  • Example 1
  • Definition 2: Order Space
  • Example 2
  • Definition 3: Hamiltonian Curiosity Order
  • Example 3
  • Definition 4: Optimal Policy
  • Definition 5: Induced Policy Subset
  • Definition 6: Generalization Error
  • Definition 7: Diversity Metric
  • ...and 10 more