Objective Decoupling in Social Reinforcement Learning: Recovering Ground Truth from Sycophantic Majorities

Majid Ghasemi; Mark Crowley

Objective Decoupling in Social Reinforcement Learning: Recovering Ground Truth from Sycophantic Majorities

Majid Ghasemi, Mark Crowley

TL;DR

This paper proposes Epistemic Source Alignment (ESA), a robust method that utilizes sparse safety axioms to judge the source of the feedback rather than the signal itself, and proves that this"judging the judges"mechanism guarantees convergence to the true objective, even when a majority of evaluators are biased.

Abstract

Contemporary AI alignment strategies rely on a fragile premise: that human feedback, while noisy, remains a fundamentally truthful signal. In this paper, we identify this assumption as Dogma 4 of Reinforcement Learning (RL). We demonstrate that while this dogma holds in static environments, it fails in social settings where evaluators may be sycophantic, lazy, or adversarial. We prove that under Dogma 4, standard RL agents suffer from what we call Objective Decoupling, a structural failure mode where the agent's learned objective permanently separates from the latent ground truth, guaranteeing convergence to misalignment. To resolve this, we propose Epistemic Source Alignment (ESA). Unlike standard robust methods that rely on statistical consensus (trusting the majority), ESA utilizes sparse safety axioms to judge the source of the feedback rather than the signal itself. We prove that this "judging the judges" mechanism guarantees convergence to the true objective, even when a majority of evaluators are biased. Empirically, we show that while traditional consensus methods fail under majority collusion, our approach successfully recovers the optimal policy.

Objective Decoupling in Social Reinforcement Learning: Recovering Ground Truth from Sycophantic Majorities

TL;DR

Abstract

Paper Structure (45 sections, 3 theorems, 14 equations, 14 figures, 3 tables, 2 algorithms)

This paper contains 45 sections, 3 theorems, 14 equations, 14 figures, 3 tables, 2 algorithms.

Introduction
Related Work
Theoretical Foundations: The Dogmas of RL
Alignment and Social Feedback Dynamics
Robustness, Trust, and Truth Discovery
Connection to Truth Discovery.
Distinctions from Alternative RL Paradigms
Problem Formulation
Theoretical Analysis: Decoupling and Recovery
Objective Decoupling
Convergence of ESA Agents
Methodology: The ESA Agent
Experimental Design
Testbeds and Environment Design.
Baselines.
...and 30 more sections

Key Result

Proposition 1

Let $\mathcal{A}$ be a finite action space. If there exists a mismatch between the social optimum $a_{soc}$ and latent optimum $a^*$ with value gap $\Delta$, then any algorithm achieving sublinear regret on the observed signal $\overline{R}$ suffers linear regret on the latent signal $R^*$. Specific

Figures (14)

Figure 1: The Social MDP Framework: Objective Decoupling vs. Epistemic Recovery.(Top) In a Social MDP, the agent does not observe the latent ground truth $R^*$; instead, it receives feedback from a "Social Layer" of evaluators who may be truthful, sycophantic, or adversarial. (Bottom Left) Standard RL agents operating under Dogma 4 treat this aggregate social signal as ground truth. When systematic bias dominates (e.g., a sycophantic majority), naive aggregation leads to Objective Decoupling, where the agent optimizes for approval rather than value. (Bottom Right) The ESA Agent intervenes by using sparse internal axioms ($z_t$) to decide source reliability. By updating trust weights ($w_t$) based on consistency with these axioms, the agent suppresses biased evaluators and asymptotically recovers the latent optimal policy.
Figure 2: Testbed 1 (Gridworld): The Sycophant Trap. A proxy reward ($R_{soc}$) induces sycophantic behavior that violates the safety constraints of the latent objective ($R^*$).
Figure 3: Both Mean (Standard) and Median (Robust) aggregation fail under majority bias. Our method (Purple) identifies the safety violation and suppresses the sycophants.
Figure 4: Continuous Control (Hopper-v4). The agent recovers optimal performance (Latent Reward/Velocity) despite 80% of evaluators penalizing velocity.
Figure 5: The Failure of Consensus (80% Bias). Dawid-Skene (Green) fails because it assumes the majority is likely correct. Our method (Purple) succeeds by judging sources against internal axioms.
...and 9 more figures

Theorems & Definitions (8)

Definition 1: Objective Decoupling Gap
Proposition 1: Rate of Objective Decoupling
Definition 2: Informational Dominance
Proposition 2: Exponential Trust Concentration
Theorem 1: Robustness to Strategic Adaptation
proof
proof
proof

Objective Decoupling in Social Reinforcement Learning: Recovering Ground Truth from Sycophantic Majorities

TL;DR

Abstract

Objective Decoupling in Social Reinforcement Learning: Recovering Ground Truth from Sycophantic Majorities

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (8)