Table of Contents
Fetching ...

Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch

Xueru Wen, Jie Lou, Zichao Li, Yaojie Lu, Xing Yu, Yuqiu Ji, Guohai Xu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Debing Zhang

TL;DR

This work introduces CheemsBench, a large-scale, fully human-annotated benchmark for Chinese reward models, and CheemsPreference, a diverse Chinese preference dataset built with distant supervision to support RM training. By combining a multi-response, human-centric evaluation with a graph-based conflict-resolution mechanism, CheemsBench provides accurate alignment signals that correlate strongly with downstream tasks. The authors demonstrate that human supervision yields state-of-the-art performance on CheemsBench, and show that AI-generated data alone cannot fully capture human preferences, underscoring the value of high-quality human data. Together, CheemsBench and CheemsPreference establish a practical foundation for advancing Chinese RM research and highlight the continued need for human-in-the-loop approaches in RM development.

Abstract

Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. However, most RM research is centered on English and relies heavily on synthetic resources, which leads to limited and less reliable datasets and benchmarks for Chinese. To address this gap, we introduce CheemsBench, a fully human-annotated RM evaluation benchmark within Chinese contexts, and CheemsPreference, a large-scale and diverse preference dataset annotated through human-machine collaboration to support Chinese RM training. We systematically evaluate open-source discriminative and generative RMs on CheemsBench and observe significant limitations in their ability to capture human preferences in Chinese scenarios. Additionally, based on CheemsPreference, we construct an RM that achieves state-of-the-art performance on CheemsBench, demonstrating the necessity of human supervision in RM training. Our findings reveal that scaled AI-generated data struggles to fully capture human preferences, emphasizing the importance of high-quality human supervision in RM development.

Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch

TL;DR

This work introduces CheemsBench, a large-scale, fully human-annotated benchmark for Chinese reward models, and CheemsPreference, a diverse Chinese preference dataset built with distant supervision to support RM training. By combining a multi-response, human-centric evaluation with a graph-based conflict-resolution mechanism, CheemsBench provides accurate alignment signals that correlate strongly with downstream tasks. The authors demonstrate that human supervision yields state-of-the-art performance on CheemsBench, and show that AI-generated data alone cannot fully capture human preferences, underscoring the value of high-quality human data. Together, CheemsBench and CheemsPreference establish a practical foundation for advancing Chinese RM research and highlight the continued need for human-in-the-loop approaches in RM development.

Abstract

Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. However, most RM research is centered on English and relies heavily on synthetic resources, which leads to limited and less reliable datasets and benchmarks for Chinese. To address this gap, we introduce CheemsBench, a fully human-annotated RM evaluation benchmark within Chinese contexts, and CheemsPreference, a large-scale and diverse preference dataset annotated through human-machine collaboration to support Chinese RM training. We systematically evaluate open-source discriminative and generative RMs on CheemsBench and observe significant limitations in their ability to capture human preferences in Chinese scenarios. Additionally, based on CheemsPreference, we construct an RM that achieves state-of-the-art performance on CheemsBench, demonstrating the necessity of human supervision in RM training. Our findings reveal that scaled AI-generated data struggles to fully capture human preferences, emphasizing the importance of high-quality human supervision in RM development.

Paper Structure

This paper contains 36 sections, 4 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: The differences in construction and usage between CheemsBench and the existing RM resources.
  • Figure 2: Chinese RM benchmark construction process. We utilize open-source prompts and human instructions and sample five responses from various models for each prompt. These responses then undergo five rounds of triple-wise manual comparisons. Unique partial rankings are generated by conflict resolving algorithm.
  • Figure 3: Chinese preference dataset construction process. Each prompt’s different responses and their annotation results form a directed graph. Circles in this preference graph indicate conflicts. We utilize the reward model trained on the human-annotated dataset to filter GPT annotations, thereby producing a directed acyclic graph.
  • Figure 4: Accuracy of top-ranked reward models on CheemsBench across subsets of different categories. The left and right sub-figures respectively show the results on open-source prompts and human instructions.
  • Figure 5: Correlations between different RM benchmarks an performance on three downstream tasks.
  • ...and 10 more figures