Table of Contents
Fetching ...

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

Tianyang Luo, Tao Feng, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

Abstract

Training large language models (LLMs) for complex reasoning via reinforcement learning requires reward labels that specify whether the generated rollouts are correct. However, obtaining reward labels at scale often requires expensive human labeling or time-consuming verification procedures; for instance, evaluating mathematical proofs demands expert review, while open-ended question answering lacks definitive ground truth. When reward labels are limited, the effectiveness of reinforcement learning fine-tuning is constrained by the scarcity of reward labels. We introduce MemReward, a graph-based experience memory framework: an initial LLM policy generates rollouts for each query, each comprising a thinking process and a final answer, and these rollouts are stored as experience memory. Queries, thinking processes, and answers form nodes in a heterogeneous graph with similarity and structural edges; a GNN trained on labeled nodes propagates rewards to unlabeled rollouts during online optimization. Experiments on Qwen2.5-3B and 1.5B across mathematics, question answering, and code generation demonstrate that MemReward, with only 20% labels, achieves 97.3% of Oracle performance on 3B and 96.6% on 1.5B, surpassing Oracle on out-of-domain tasks. Performance scales smoothly with label budget, reaching 99.4% of Oracle at 70% labels.

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

Abstract

Training large language models (LLMs) for complex reasoning via reinforcement learning requires reward labels that specify whether the generated rollouts are correct. However, obtaining reward labels at scale often requires expensive human labeling or time-consuming verification procedures; for instance, evaluating mathematical proofs demands expert review, while open-ended question answering lacks definitive ground truth. When reward labels are limited, the effectiveness of reinforcement learning fine-tuning is constrained by the scarcity of reward labels. We introduce MemReward, a graph-based experience memory framework: an initial LLM policy generates rollouts for each query, each comprising a thinking process and a final answer, and these rollouts are stored as experience memory. Queries, thinking processes, and answers form nodes in a heterogeneous graph with similarity and structural edges; a GNN trained on labeled nodes propagates rewards to unlabeled rollouts during online optimization. Experiments on Qwen2.5-3B and 1.5B across mathematics, question answering, and code generation demonstrate that MemReward, with only 20% labels, achieves 97.3% of Oracle performance on 3B and 96.6% on 1.5B, surpassing Oracle on out-of-domain tasks. Performance scales smoothly with label budget, reaching 99.4% of Oracle at 70% labels.
Paper Structure (24 sections, 7 equations, 6 figures, 15 tables, 1 algorithm)

This paper contains 24 sections, 7 equations, 6 figures, 15 tables, 1 algorithm.

Figures (6)

  • Figure 1: MemReward approaches Oracle performance with only 20% labels. Using the same 20% ground-truth labels, MemReward (purple) substantially outperforms partial labels (R1-p, gray), approaching fully-supervised Oracle performance (green) on in-domain tasks and surpassing it on out-of-domain tasks across both model scales.
  • Figure 2: Overview of MemReward. Rollouts generated by the initial policy are stored as experience memory and organized into a heterogeneous graph for reward prediction. (Left) Warmup Phase: We construct a heterogeneous graph from labeled queries, where query nodes connect via embedding similarity, and each query links to its thinking and answer nodes. A GNN is trained to predict rewards through relational message passing. (Right) Online Phase: During GRPO training, labeled queries receive ground-truth rewards while unlabeled queries connect to the warmup graph via top-$k$ similarity edges and obtain GNN-predicted rewards.
  • Figure 3: Ablation studies on (a) Qwen2.5-3B and (b) Qwen2.5-1.5B show each architectural component contributes to performance. The full model consistently outperforms all ablated variants on both scales across all three task categories.
  • Figure 4: MemReward consistently improves over R1-p across all 13 benchmarks on Qwen2.5-1.5B, with the largest gains on mathematical reasoning (GSM-Sym +14.9, GSM8K +11.6) and the smallest on well-saturated tasks (MBPP+ 0.0).
  • Figure 5: MemReward performance scales with ground-truth label ratio on Qwen2.5-3B. Each bar shows the overall average score. Even at 20% GT, MemReward reaches 97.3% of Oracle.
  • ...and 1 more figures