Table of Contents
Fetching ...

MARS: Margin-Aware Reward-Modeling with Self-Refinement

Payel Bhattacharjee, Osvaldo Simeone, Ravi Tandon

TL;DR

This paper proposes MARS, an adaptive, margin-aware augmentation and sampling strategy that explicitly targets ambiguous and failure modes of the reward model, and concentrates augmentation on low-margin preference pairs where the reward model is most uncertain, and iteratively refines the training distribution via hard-sample augmentation.

Abstract

Reward modeling is a core component of modern alignment pipelines including RLHF and RLAIF, underpinning policy optimization methods including PPO and TRPO. However, training reliable reward models relies heavily on human-labeled preference data, which is costly and limited, motivating the use of data augmentation. Existing augmentation approaches typically operate at the representation or semantic level and remain agnostic to the reward model's estimation difficulty. In this paper, we propose MARS, an adaptive, margin-aware augmentation and sampling strategy that explicitly targets ambiguous and failure modes of the reward model. Our proposed framework, MARS, concentrates augmentation on low-margin (ambiguous) preference pairs where the reward model is most uncertain, and iteratively refines the training distribution via hard-sample augmentation. We provide theoretical guarantees showing that this strategy increases the average curvature of the loss function hence enhance information and improves conditioning, along with empirical results demonstrating consistent gains over uniform augmentation for robust reward modeling.

MARS: Margin-Aware Reward-Modeling with Self-Refinement

TL;DR

This paper proposes MARS, an adaptive, margin-aware augmentation and sampling strategy that explicitly targets ambiguous and failure modes of the reward model, and concentrates augmentation on low-margin preference pairs where the reward model is most uncertain, and iteratively refines the training distribution via hard-sample augmentation.

Abstract

Reward modeling is a core component of modern alignment pipelines including RLHF and RLAIF, underpinning policy optimization methods including PPO and TRPO. However, training reliable reward models relies heavily on human-labeled preference data, which is costly and limited, motivating the use of data augmentation. Existing augmentation approaches typically operate at the representation or semantic level and remain agnostic to the reward model's estimation difficulty. In this paper, we propose MARS, an adaptive, margin-aware augmentation and sampling strategy that explicitly targets ambiguous and failure modes of the reward model. Our proposed framework, MARS, concentrates augmentation on low-margin (ambiguous) preference pairs where the reward model is most uncertain, and iteratively refines the training distribution via hard-sample augmentation. We provide theoretical guarantees showing that this strategy increases the average curvature of the loss function hence enhance information and improves conditioning, along with empirical results demonstrating consistent gains over uniform augmentation for robust reward modeling.
Paper Structure (15 sections, 1 theorem, 28 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 1 theorem, 28 equations, 10 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Let $R=\alpha P+(1-\alpha)Q$ with $\alpha\in[0,1]$. Under Assumptions 1 and 2, the average curvature eq:avgHessian satisfies the positive semidefinite (PSD) domination condition where $\gamma_{\mathrm{curv}}:=\beta\,c(\gamma_{\mathrm{aug}})/c(\gamma_{\mathrm{org}}).$ This implies that if $\gamma_{\mathrm{curv}}>1$, the mixture distribution $R$ induces uniformly larger curvature than $P$ in all pa

Figures (10)

  • Figure 1: Comparison between MARS (this paper) and the existing methods including no augmentation, Uniform Augmentation, West-of-N (WoN) westofNpace2024west with PKU-SafeRLHF ji2024pku dataset and DeBERTa-v3-base model on different evaluation metrices: (1) SNR: the ratio of the mean and standard deviation of the obtained margin, (2) Pairwise Accuracy: reward for the chosen response is higher than the rejected response, and (3) Win-Rate of aligned TinyLlama-1.1B-Chat-v1.0 and Llama-3.2-1B-Instruct models using the trained reward models.
  • Figure 2: Adaptive data augmentation-based Reward Modeling in MARS: At every epoch $t$, the reward model (RM) from previous stage ${r_{\theta}^{t-1}}$ calculates the margin of all the samples in preference dataset $\mathcal{D}$, and samples with lower $|\Delta_i^t|$ get more budget for augmented samples. Then, updated dataset (preference dataset and the synthetic dataset) is used to train the reward model ${r_{\theta}^{t-1}}$ to get model ${r_{\theta}^{t}}$.
  • Figure 3: Proposed workflow: Adaptive augmentation and refinement workflow. At every epoch $t$, the reward model (RM) from previous stage ${r_{\theta}^{t-1}}$ calculates the margin that enables the calculation of selection/augmentation probability $q_i^t$, and given a fixed budget $B^t$, for every $i^{th}$-sample augmented samples are generated (such that $n_i^++n_i^-=B^t\cdot q_i^t$). Then, based on the calculated margin of all the samples augmented samples are generated and concatenated to get the augmented dataset $(\mathcal{D}^t=\mathcal{D}^{t-1}\cup \mathcal{D}_{\text{syn}})$. Then $\mathcal{D}^t$ is used to train the reward model ${r_{\theta}^{t-1}}$ via adaptively augmented preference pair samples to get model ${r_{\theta}^{t}}$.
  • Figure 4: Pairwise accuracy of DeBERTa-v3-base reward models under different training strategies. Results are reported on the Anthropic HH-RLHF HHRLHF_dataset_bai2022training, UltraFeedback cui2023ultrafeedback, and PKU-SelfRLHF ji2024pku test datasets. We compared training without augmentation, Uniform Augmentation, West-of-N (WoN) westofNpace2024west and MARS (this paper).
  • Figure 5: Small-margin (hard) preference pairs exhibit higher curvature. Left: Minimum eigenvalue of the bin-averaged empirical Fisher matrix, $\lambda_{\min}\!\left(\frac{1}{|B|}\sum_{z\in B} \widehat{I}(z)\right)$, across equal-count bins sorted by $| \Delta_{\theta}(z) |$. Right: Mean curvature weight $\mathbb{E}_{z\sim B}[\sigma(\Delta_\theta(z))(1-\sigma(\Delta_\theta(z)))]$. Hard samples (small $\lvert \Delta \rvert$) induce substantially higher curvature than confident pairs.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Theorem 1: Margin-Induced Average Curvature