Table of Contents
Fetching ...

A Mathematical Programming Approach to Computing and Learning Berk--Nash Equilibria in Infinite-Horizon MDPs

Quanyan Zhu, Zhengye Han

Abstract

We study sequential decision-making when the agent's internal model class is misspecified. Within the infinite-horizon Berk-Nash framework, stable behavior arises as a fixed point: the agent acts optimally relative to a subjective model, while that model is statistically consistent with the long-run data endogenously generated by the policy itself. We provide a rigorous characterization of this equilibrium via coupled linear programs and a bilevel optimization formulation. To address the intrinsic non-smoothness of standard best-response correspondences, we introduce entropy regularization, establishing the existence of a unique soft Bellman fixed point and a smooth objective. Exploiting this regularity, we develop an online learning scheme that casts model selection as an adversarial bandit problem using an EXP3-type update, augmented by a novel conjecture-set zooming mechanism that adaptively refines the parameter space. Numerical results demonstrate effective exploration-exploitation trade-offs, convergence to the KL-minimizing model, and sublinear regret.

A Mathematical Programming Approach to Computing and Learning Berk--Nash Equilibria in Infinite-Horizon MDPs

Abstract

We study sequential decision-making when the agent's internal model class is misspecified. Within the infinite-horizon Berk-Nash framework, stable behavior arises as a fixed point: the agent acts optimally relative to a subjective model, while that model is statistically consistent with the long-run data endogenously generated by the policy itself. We provide a rigorous characterization of this equilibrium via coupled linear programs and a bilevel optimization formulation. To address the intrinsic non-smoothness of standard best-response correspondences, we introduce entropy regularization, establishing the existence of a unique soft Bellman fixed point and a smooth objective. Exploiting this regularity, we develop an online learning scheme that casts model selection as an adversarial bandit problem using an EXP3-type update, augmented by a novel conjecture-set zooming mechanism that adaptively refines the parameter space. Numerical results demonstrate effective exploration-exploitation trade-offs, convergence to the KL-minimizing model, and sublinear regret.
Paper Structure (42 sections, 9 theorems, 20 equations, 6 figures, 2 algorithms)

This paper contains 42 sections, 9 theorems, 20 equations, 6 figures, 2 algorithms.

Key Result

lemma 1

Under Assumptions ass:primitives--ass:KL, for each $\pi\in\Sigma$:

Figures (6)

  • Figure 1: Empirical selection frequencies. The algorithm concentrates 82% of decisions on $\theta^1$.
  • Figure 2: Instantaneous loss $\widehat{J}_t$ and running average. Convergence to BN loss (0.012) observed.
  • Figure 3: Policy $\pi_{\theta^1,\lambda}(\cdot|0)$ vs. $\log_{10}\lambda$. Transition from uniform to deterministic.
  • Figure 4: Value $v_{\theta^1,\lambda}$ vs. $\log_{10}\lambda$. Values converge smoothly to the unregularized limit.
  • Figure 5: Selected $\epsilon_t$. The algorithm rapidly concentrates on the true model region ($\epsilon \approx 0$).
  • ...and 1 more figures

Theorems & Definitions (21)

  • definition 1: Subjective MDP
  • definition 2: Subjectively Best-Response Policy
  • definition 3: Long-Run KL Divergence
  • definition 4: Pseudo-True Parameter Set
  • definition 5: Infinite-Horizon Berk--Nash Solution
  • lemma 1: Properties of the pseudo-true parameter correspondence
  • proof
  • lemma 2: Properties of the best-response correspondence
  • proof
  • theorem 1: Existence of Infinite-Horizon BN Solution
  • ...and 11 more