Table of Contents
Fetching ...

BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents

Praveen Kumar Myakala, Manan Agrawal, Rahul Manche

Abstract

LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved. That's the wrong model. People change their minds, and over extended interactions, phenomena like opinion drift, over-alignment, and confirmation bias start to matter a lot. BeliefShift introduces a longitudinal benchmark designed specifically to evaluate belief dynamics in multi-session LLM interactions. It covers three tracks: Temporal Belief Consistency, Contradiction Detection, and Evidence-Driven Revision. The dataset includes 2,400 human-annotated multi-session interaction trajectories spanning health, politics, personal values, and product preferences. We evaluate seven models including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA-3, and Mistral-Large under zero-shot and retrieval-augmented generation (RAG) settings. Results reveal a clear trade-off: models that personalize aggressively resist drift poorly, while factually grounded models miss legitimate belief updates. We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).

BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents

Abstract

LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved. That's the wrong model. People change their minds, and over extended interactions, phenomena like opinion drift, over-alignment, and confirmation bias start to matter a lot. BeliefShift introduces a longitudinal benchmark designed specifically to evaluate belief dynamics in multi-session LLM interactions. It covers three tracks: Temporal Belief Consistency, Contradiction Detection, and Evidence-Driven Revision. The dataset includes 2,400 human-annotated multi-session interaction trajectories spanning health, politics, personal values, and product preferences. We evaluate seven models including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA-3, and Mistral-Large under zero-shot and retrieval-augmented generation (RAG) settings. Results reveal a clear trade-off: models that personalize aggressively resist drift poorly, while factually grounded models miss legitimate belief updates. We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).
Paper Structure (34 sections, 5 equations, 4 figures, 6 tables)

This paper contains 34 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The sycophancy feedback loop leading to the mirroring effect. Over successive sessions, the model progressively reinforces user beliefs $U_t$, amplifying them into $U_{t+1}$ without independent grounding methuku2025doppelgangersborah2025mind. BeliefShift's Drift Coherence Score (DCS) quantifies this loop across all sessions in a trajectory.
  • Figure 2: Rational belief revision (blue) versus sycophantic drift (red). In revision, a new belief $B_r$ is caused by external evidence $E$. In drift, the belief shift $B_d$ is induced by model bias $M$, with no grounding in new information. BeliefShift's Evidence Sensitivity Index (ESI) distinguishes these two paths.
  • Figure 3: BSV trajectories for three representative model behaviors across a 10-session trajectory containing two evidence events ($E_1$ at session 4, $E_2$ at session 7, marked by dashed orange lines). Model A tracks ground truth belief shifts accurately (high BRA, high ESI). Model B drifts monotonically upward despite the downward revision at $E_2$ (low DCS, negative ESI). Model C resists all belief change including legitimate evidence-driven revisions (low BRA, high DCS). The metric scores below the plot confirm the stability-adaptability trade-off identified in Section \ref{['sec:intro']}.
  • Figure 4: ESI scores for all seven models under RAG as the sensitivity threshold $\theta$ varies from 0.05 to 0.35. The default threshold $\theta^* = 0.15$ is marked with a dashed orange line. Model rankings remain stable across the full range, confirming robustness of the ESI metric to threshold choice.