Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment

Hongbin Zhang; Kehai Chen; Xuefeng Bai; Yang Xiang; Min Zhang

Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment

Hongbin Zhang, Kehai Chen, Xuefeng Bai, Yang Xiang, Min Zhang

TL;DR

CARB introduces a multilingual, culture-aware benchmark for reward models (RMs) that evaluates 10 cultures across 4 cultural domains using Best-of-N tasks. The study reveals that state-of-the-art generative RMs generally outperform classifier-based ones in multilingual cultural alignment, while also uncovering spurious correlations that misalign with human preferences. It demonstrates a strong positive relationship between CARB performance and downstream multilingual cultural alignment, and shows robustness gaps in cross-lingual scoring. To address these issues, the authors propose Think-as-Locals with reinforcement learning from verifiable rewards (RLVR), a structured, rubric-driven approach that reduces reliance on surface cues and improves culturally grounded judgments. Overall, CARB provides a critical tool for efficient RM selection and culture-aware optimization of multilingual LLMs, with Think-as-Locals offering a practical path to more robust cultural alignment.

Abstract

Reward models (RMs) are crucial for aligning large language models (LLMs) with diverse cultures. Consequently, evaluating their cultural awareness is essential for further advancing global alignment of LLMs. However, existing RM evaluations fall short in assessing cultural awareness due to the scarcity of culturally relevant evaluation datasets. To fill this gap, we propose Cultural Awareness Reward modeling Benchmark (CARB), covering 10 distinct cultures across 4 cultural domains. Our extensive evaluation of state-of-the-art RMs reveals their deficiencies in modeling cultural awareness and demonstrates a positive correlation between performance on CARB and downstream multilingual cultural alignment tasks. Further analysis identifies the spurious correlations within culture-aware reward modeling, wherein RM's scoring relies predominantly on surface-level features rather than authentic cultural nuance understanding. To address these, we propose Think-as-Locals to elicit deeper culturally grounded reasoning from generative RMs via reinforcement learning from verifiable rewards (RLVR) and employ well-designed rewards to ensure accurate preference judgments and high-quality structured evaluation criteria generation. Experimental results validate its efficacy in mitigating spurious features interference and advancing culture-aware reward modeling.

Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment

TL;DR

Abstract

Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (30)