Table of Contents
Fetching ...

Learning When to Trust in Contextual Bandits

Majid Ghasemi, Mark Crowley

Abstract

Standard approaches to Robust Reinforcement Learning assume that feedback sources are either globally trustworthy or globally adversarial. In this paper, we challenge this assumption and we identify a more subtle failure mode. We term this mode as Contextual Sycophancy, where evaluators are truthful in benign contexts but strategically biased in critical ones. We prove that standard robust methods fail in this setting, suffering from Contextual Objective Decoupling. To address this, we propose CESA-LinUCB, which learns a high-dimensional Trust Boundary for each evaluator. We prove that CESA-LinUCB achieves sublinear regret $\tilde{O}(\sqrt{T})$ against contextual adversaries, recovering the ground truth even when no evaluator is globally reliable.

Learning When to Trust in Contextual Bandits

Abstract

Standard approaches to Robust Reinforcement Learning assume that feedback sources are either globally trustworthy or globally adversarial. In this paper, we challenge this assumption and we identify a more subtle failure mode. We term this mode as Contextual Sycophancy, where evaluators are truthful in benign contexts but strategically biased in critical ones. We prove that standard robust methods fail in this setting, suffering from Contextual Objective Decoupling. To address this, we propose CESA-LinUCB, which learns a high-dimensional Trust Boundary for each evaluator. We prove that CESA-LinUCB achieves sublinear regret against contextual adversaries, recovering the ground truth even when no evaluator is globally reliable.
Paper Structure (17 sections, 4 theorems, 19 equations, 5 figures, 1 algorithm)

This paper contains 17 sections, 4 theorems, 19 equations, 5 figures, 1 algorithm.

Key Result

Theorem 1

Let $\pi_{soc}(x)$ be the optimal policy under social feedback, and $\pi^*(x)$ be the ground-truth optimal policy. If there exists a sub-region $\mathcal{X}_{dec} \subset \mathcal{X}$ with measure $\mu(\mathcal{X}_{dec}) > 0$ such that for all $x \in \mathcal{X}_{dec}$: Then, any algorithm $\mathfrak{A}$ with sublinear regret on $\bar{y}$ suffers linear regret on $R^*$. Specifically, $\mathcal{R}

Figures (5)

  • Figure 1: Contextual Objective Decoupling vs. Epistemic Source Alignment.(A) The Failure Mode: In a "Hostile Majority" environment (50% sycophants, 30% contextual liars), standard robust aggregators suffer from Contextual Objective Decoupling. The social consensus structurally diverges from the ground truth in critical "decoupling regions," trapping the agent in a sycophantic policy. (B) The Solution (CESA-LinUCB): The proposed algorithm resolves this by learning a high-dimensional Trust Boundary for each evaluator. The process involves: (1) Epistemic Forecasting to predict context-specific trustworthiness, (2) Trust-Weighted Action Selection to filter adversarial feedback, (3) Sparse Axiomatic Checks ($z_t$) to anchor the trust model to reality, and (4) Weighted Ridge Regression (WRR) to update the policy using only the verified honest minority.
  • Figure 2: Robustness:CESA-LinUCB (Green) recovers the ground truth despite an 80% hostile majority, while the baseline (Gray) suffers linear regret. Performance on the "Hostile Majority" Benchmark ($D=20, M=10$).
  • Figure 3: Dynamics: The agent learns to trust Honest evaluators (Green) while suppressing Sycophants (Red) and Liars (Orange).
  • Figure 4: Sensitivity Analysis. (a) confirms the $\tilde{O}(d)$ complexity bound. (b) reveals a "Phase Transition" where minimal ground-truth checking yields massive safety gains.
  • Figure 5: Algorithm Agnostic Robustness. Comparison of CESA-LinUCB (Green) and ESA-Thompson Sampling (Blue) in the "Hostile Majority" environment. Both methods avoid the catastrophic failure mode of standard robust baselines, confirming that the Epistemic Trust mechanism functions effectively as a universal data sanitation layer for both frequentist and Bayesian exploration strategies.

Theorems & Definitions (10)

  • Definition 1: Context-Dependent Bias
  • Definition 2: The Trust Function
  • Theorem 1: Contextual Objective Decoupling (COD)
  • proof
  • Lemma 2: The Price of Distrust
  • proof
  • Theorem 3: Regret Upper Bound
  • proof
  • Proposition 4: Sycophantic Complexity
  • proof