Table of Contents
Fetching ...

Learning Can Converge Stably to the Wrong Belief under Latent Reliability

Zhipeng Zhang, Zhenjie Yao, Kai Li, Lei Yang

Abstract

Learning systems are typically optimized by minimizing loss or maximizing reward, assuming that improvements in these signals reflect progress toward the true objective. However, when feedback reliability is unobservable, this assumption can fail, and learning algorithms may converge stably to incorrect solutions. This failure arises because single-step feedback does not reveal whether an experience is informative or persistently biased. When information is aggregated over learning trajectories, however, systematic differences between reliable and unreliable regimes can emerge. We propose a Monitor-Trust-Regulator (MTR) framework that infers reliability from learning dynamics and modulates updates through a slow-timescale trust variable. Across reinforcement learning and supervised learning settings, standard algorithms exhibit stable optimization behavior while learning incorrect solutions under latent unreliability, whereas trust-modulated systems reduce bias accumulation and improve recovery. These results suggest that learning dynamics are not only optimization traces but also a source of information about feedback reliability.

Learning Can Converge Stably to the Wrong Belief under Latent Reliability

Abstract

Learning systems are typically optimized by minimizing loss or maximizing reward, assuming that improvements in these signals reflect progress toward the true objective. However, when feedback reliability is unobservable, this assumption can fail, and learning algorithms may converge stably to incorrect solutions. This failure arises because single-step feedback does not reveal whether an experience is informative or persistently biased. When information is aggregated over learning trajectories, however, systematic differences between reliable and unreliable regimes can emerge. We propose a Monitor-Trust-Regulator (MTR) framework that infers reliability from learning dynamics and modulates updates through a slow-timescale trust variable. Across reinforcement learning and supervised learning settings, standard algorithms exhibit stable optimization behavior while learning incorrect solutions under latent unreliability, whereas trust-modulated systems reduce bias accumulation and improve recovery. These results suggest that learning dynamics are not only optimization traces but also a source of information about feedback reliability.
Paper Structure (27 sections, 12 equations, 6 figures)

This paper contains 27 sections, 12 equations, 6 figures.

Figures (6)

  • Figure 1: Scale-dependent detectability under latent reliability. (A) Under latent drift, estimation error initially decreases but eventually increases with more data, despite apparently stable learning dynamics. (B) Local (finite-window) statistics fail to distinguish between reliable and biased regimes (AUC $\approx 0.5$), whereas trajectory-level features enable reliable detection (AUC $\approx 0.85$). (C) Trajectory-level detectability increases continuously with drift strength, indicating that identifiability emerges progressively as evidence accumulates over time.
  • Figure 2: Stable but incorrect convergence under persistent bias. Despite monotonic reduction of the optimization signal and apparent convergence, the parameter converges to a biased solution $\theta^\dagger \neq \theta^*$. This minimal construction shows that satisfying standard convergence criteria does not guarantee correctness of the learned solution under persistent bias.
  • Figure 3: Stable but incorrect learning behavior under latent reliability. During the corruption phase, the standard learner (PPO) exhibits stable training dynamics and continued updates, yet its performance deteriorates substantially. After reliable feedback is restored, it fails to recover to near-clean performance and remains trapped in a degraded regime. In contrast, trust-modulated learning maintains recoverability and returns to near-clean performance after corruption. So stability of the training process does not by itself imply that the learned solution is correct under latent reliability.
  • Figure 4: Trajectory-level identifiability under latent reliability. (Left) Trajectory instability $S_t$ exhibits systematic regime-dependent behavior: under persistent corruption, instability increases and remains elevated relative to clean conditions. (Right) The distribution of $S_t$ becomes statistically separable between clean and corrupted regimes, despite instantaneous signals remaining locally indistinguishable. The plots align with the view that reliability leaves a signature in aggregated dynamics, not in isolated gradient steps.
  • Figure 5: The Monitor--Trust--Regulator (MTR) framework for metacognitive regulation. A secondary regulatory loop (top) operates alongside the primary learning loop (bottom). The Monitor $\mathcal{M}$ extracts trajectory-level stability signals from learning dynamics, which are aggregated by the Trust Estimator $\mathcal{T}$ over a slower timescale to form a trust estimate $\tau_t$. The Regulator $\mathcal{R}$ uses this estimate to modulate the effective learning gain of the base learner $\mathcal{B}$. This architecture implements trajectory-level reliability inference as a structural component of learning under latent reliability, without requiring explicit reliability labels.
  • ...and 1 more figures