Table of Contents
Fetching ...

Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment

Hongbin Zhang, Kehai Chen, Xuefeng Bai, Yang Xiang, Min Zhang

TL;DR

CARB introduces a multilingual, culture-aware benchmark for reward models (RMs) that evaluates 10 cultures across 4 cultural domains using Best-of-N tasks. The study reveals that state-of-the-art generative RMs generally outperform classifier-based ones in multilingual cultural alignment, while also uncovering spurious correlations that misalign with human preferences. It demonstrates a strong positive relationship between CARB performance and downstream multilingual cultural alignment, and shows robustness gaps in cross-lingual scoring. To address these issues, the authors propose Think-as-Locals with reinforcement learning from verifiable rewards (RLVR), a structured, rubric-driven approach that reduces reliance on surface cues and improves culturally grounded judgments. Overall, CARB provides a critical tool for efficient RM selection and culture-aware optimization of multilingual LLMs, with Think-as-Locals offering a practical path to more robust cultural alignment.

Abstract

Reward models (RMs) are crucial for aligning large language models (LLMs) with diverse cultures. Consequently, evaluating their cultural awareness is essential for further advancing global alignment of LLMs. However, existing RM evaluations fall short in assessing cultural awareness due to the scarcity of culturally relevant evaluation datasets. To fill this gap, we propose Cultural Awareness Reward modeling Benchmark (CARB), covering 10 distinct cultures across 4 cultural domains. Our extensive evaluation of state-of-the-art RMs reveals their deficiencies in modeling cultural awareness and demonstrates a positive correlation between performance on CARB and downstream multilingual cultural alignment tasks. Further analysis identifies the spurious correlations within culture-aware reward modeling, wherein RM's scoring relies predominantly on surface-level features rather than authentic cultural nuance understanding. To address these, we propose Think-as-Locals to elicit deeper culturally grounded reasoning from generative RMs via reinforcement learning from verifiable rewards (RLVR) and employ well-designed rewards to ensure accurate preference judgments and high-quality structured evaluation criteria generation. Experimental results validate its efficacy in mitigating spurious features interference and advancing culture-aware reward modeling.

Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment

TL;DR

CARB introduces a multilingual, culture-aware benchmark for reward models (RMs) that evaluates 10 cultures across 4 cultural domains using Best-of-N tasks. The study reveals that state-of-the-art generative RMs generally outperform classifier-based ones in multilingual cultural alignment, while also uncovering spurious correlations that misalign with human preferences. It demonstrates a strong positive relationship between CARB performance and downstream multilingual cultural alignment, and shows robustness gaps in cross-lingual scoring. To address these issues, the authors propose Think-as-Locals with reinforcement learning from verifiable rewards (RLVR), a structured, rubric-driven approach that reduces reliance on surface cues and improves culturally grounded judgments. Overall, CARB provides a critical tool for efficient RM selection and culture-aware optimization of multilingual LLMs, with Think-as-Locals offering a practical path to more robust cultural alignment.

Abstract

Reward models (RMs) are crucial for aligning large language models (LLMs) with diverse cultures. Consequently, evaluating their cultural awareness is essential for further advancing global alignment of LLMs. However, existing RM evaluations fall short in assessing cultural awareness due to the scarcity of culturally relevant evaluation datasets. To fill this gap, we propose Cultural Awareness Reward modeling Benchmark (CARB), covering 10 distinct cultures across 4 cultural domains. Our extensive evaluation of state-of-the-art RMs reveals their deficiencies in modeling cultural awareness and demonstrates a positive correlation between performance on CARB and downstream multilingual cultural alignment tasks. Further analysis identifies the spurious correlations within culture-aware reward modeling, wherein RM's scoring relies predominantly on surface-level features rather than authentic cultural nuance understanding. To address these, we propose Think-as-Locals to elicit deeper culturally grounded reasoning from generative RMs via reinforcement learning from verifiable rewards (RLVR) and employ well-designed rewards to ensure accurate preference judgments and high-quality structured evaluation criteria generation. Experimental results validate its efficacy in mitigating spurious features interference and advancing culture-aware reward modeling.

Paper Structure

This paper contains 58 sections, 8 equations, 30 figures, 19 tables.

Figures (30)

  • Figure 1: Overview of CARB. (a) The example of CARB and Best-of-N evaluation paradigm; (b) Evaluating the reward modeling across cultural commonsense, values, safety, and linguistics.
  • Figure 2: Performance across three linguistic dimensions: resource availability, language family, and script. Resource availability categorization is based on joshi-etal-2020-state, with higher-numbered classes having more data resources. Language family and script are based on singh-etal-2024-aya.
  • Figure 3: The performance of the top-3 classifier-based and generative RMs across domains.
  • Figure 4: Comparison of the correlation between the reward benchmark and alignment performance. The x-axis lists policy models used for BoN sampling.
  • Figure 5: The lines illustrate the linear relationship between downstream ratings and performance on reward benchmarks, with the coefficient of determination ($r^2$) indicating the strength of this linear correlation and the p-values ($p$) indicating statistical significance.
  • ...and 25 more figures