Table of Contents
Fetching ...

Objective Decoupling in Social Reinforcement Learning: Recovering Ground Truth from Sycophantic Majorities

Majid Ghasemi, Mark Crowley

TL;DR

This paper proposes Epistemic Source Alignment (ESA), a robust method that utilizes sparse safety axioms to judge the source of the feedback rather than the signal itself, and proves that this"judging the judges"mechanism guarantees convergence to the true objective, even when a majority of evaluators are biased.

Abstract

Contemporary AI alignment strategies rely on a fragile premise: that human feedback, while noisy, remains a fundamentally truthful signal. In this paper, we identify this assumption as Dogma 4 of Reinforcement Learning (RL). We demonstrate that while this dogma holds in static environments, it fails in social settings where evaluators may be sycophantic, lazy, or adversarial. We prove that under Dogma 4, standard RL agents suffer from what we call Objective Decoupling, a structural failure mode where the agent's learned objective permanently separates from the latent ground truth, guaranteeing convergence to misalignment. To resolve this, we propose Epistemic Source Alignment (ESA). Unlike standard robust methods that rely on statistical consensus (trusting the majority), ESA utilizes sparse safety axioms to judge the source of the feedback rather than the signal itself. We prove that this "judging the judges" mechanism guarantees convergence to the true objective, even when a majority of evaluators are biased. Empirically, we show that while traditional consensus methods fail under majority collusion, our approach successfully recovers the optimal policy.

Objective Decoupling in Social Reinforcement Learning: Recovering Ground Truth from Sycophantic Majorities

TL;DR

This paper proposes Epistemic Source Alignment (ESA), a robust method that utilizes sparse safety axioms to judge the source of the feedback rather than the signal itself, and proves that this"judging the judges"mechanism guarantees convergence to the true objective, even when a majority of evaluators are biased.

Abstract

Contemporary AI alignment strategies rely on a fragile premise: that human feedback, while noisy, remains a fundamentally truthful signal. In this paper, we identify this assumption as Dogma 4 of Reinforcement Learning (RL). We demonstrate that while this dogma holds in static environments, it fails in social settings where evaluators may be sycophantic, lazy, or adversarial. We prove that under Dogma 4, standard RL agents suffer from what we call Objective Decoupling, a structural failure mode where the agent's learned objective permanently separates from the latent ground truth, guaranteeing convergence to misalignment. To resolve this, we propose Epistemic Source Alignment (ESA). Unlike standard robust methods that rely on statistical consensus (trusting the majority), ESA utilizes sparse safety axioms to judge the source of the feedback rather than the signal itself. We prove that this "judging the judges" mechanism guarantees convergence to the true objective, even when a majority of evaluators are biased. Empirically, we show that while traditional consensus methods fail under majority collusion, our approach successfully recovers the optimal policy.
Paper Structure (45 sections, 3 theorems, 14 equations, 14 figures, 3 tables, 2 algorithms)

This paper contains 45 sections, 3 theorems, 14 equations, 14 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

Let $\mathcal{A}$ be a finite action space. If there exists a mismatch between the social optimum $a_{soc}$ and latent optimum $a^*$ with value gap $\Delta$, then any algorithm achieving sublinear regret on the observed signal $\overline{R}$ suffers linear regret on the latent signal $R^*$. Specific

Figures (14)

  • Figure 1: The Social MDP Framework: Objective Decoupling vs. Epistemic Recovery.(Top) In a Social MDP, the agent does not observe the latent ground truth $R^*$; instead, it receives feedback from a "Social Layer" of evaluators who may be truthful, sycophantic, or adversarial. (Bottom Left) Standard RL agents operating under Dogma 4 treat this aggregate social signal as ground truth. When systematic bias dominates (e.g., a sycophantic majority), naive aggregation leads to Objective Decoupling, where the agent optimizes for approval rather than value. (Bottom Right) The ESA Agent intervenes by using sparse internal axioms ($z_t$) to decide source reliability. By updating trust weights ($w_t$) based on consistency with these axioms, the agent suppresses biased evaluators and asymptotically recovers the latent optimal policy.
  • Figure 2: Testbed 1 (Gridworld): The Sycophant Trap. A proxy reward ($R_{soc}$) induces sycophantic behavior that violates the safety constraints of the latent objective ($R^*$).
  • Figure 3: Both Mean (Standard) and Median (Robust) aggregation fail under majority bias. Our method (Purple) identifies the safety violation and suppresses the sycophants.
  • Figure 4: Continuous Control (Hopper-v4). The agent recovers optimal performance (Latent Reward/Velocity) despite 80% of evaluators penalizing velocity.
  • Figure 5: The Failure of Consensus (80% Bias). Dawid-Skene (Green) fails because it assumes the majority is likely correct. Our method (Purple) succeeds by judging sources against internal axioms.
  • ...and 9 more figures

Theorems & Definitions (8)

  • Definition 1: Objective Decoupling Gap
  • Proposition 1: Rate of Objective Decoupling
  • Definition 2: Informational Dominance
  • Proposition 2: Exponential Trust Concentration
  • Theorem 1: Robustness to Strategic Adaptation
  • proof
  • proof
  • proof