A Mathematical Programming Approach to Computing and Learning Berk--Nash Equilibria in Infinite-Horizon MDPs

Quanyan Zhu; Zhengye Han

A Mathematical Programming Approach to Computing and Learning Berk--Nash Equilibria in Infinite-Horizon MDPs

Quanyan Zhu, Zhengye Han

Abstract

We study sequential decision-making when the agent's internal model class is misspecified. Within the infinite-horizon Berk-Nash framework, stable behavior arises as a fixed point: the agent acts optimally relative to a subjective model, while that model is statistically consistent with the long-run data endogenously generated by the policy itself. We provide a rigorous characterization of this equilibrium via coupled linear programs and a bilevel optimization formulation. To address the intrinsic non-smoothness of standard best-response correspondences, we introduce entropy regularization, establishing the existence of a unique soft Bellman fixed point and a smooth objective. Exploiting this regularity, we develop an online learning scheme that casts model selection as an adversarial bandit problem using an EXP3-type update, augmented by a novel conjecture-set zooming mechanism that adaptively refines the parameter space. Numerical results demonstrate effective exploration-exploitation trade-offs, convergence to the KL-minimizing model, and sublinear regret.

A Mathematical Programming Approach to Computing and Learning Berk--Nash Equilibria in Infinite-Horizon MDPs

Abstract

Paper Structure (42 sections, 9 theorems, 20 equations, 6 figures, 2 algorithms)

This paper contains 42 sections, 9 theorems, 20 equations, 6 figures, 2 algorithms.

Introduction
Main Contribution of This Work
Related Work
Organization of the Paper
Infinite-Horizon Berk--Nash Framework for a Finite Markov Decision Problem
True Markov Decision Process
Subjective Parametric Models
Subjective Optimality and the Best Response
Long-Run Statistical Consistency
Infinite-Horizon Berk--Nash Solution
Regularity Conditions
Regularity of $BR(\theta)$ and $\Theta^\ast(\pi)$
Existence of Infinite-Horizon Berk--Nash Solutions
LP Characterization of the Subjective Best Response
Primal and Dual Linear Programs
...and 27 more sections

Key Result

lemma 1

Under Assumptions ass:primitives--ass:KL, for each $\pi\in\Sigma$:

Figures (6)

Figure 1: Empirical selection frequencies. The algorithm concentrates 82% of decisions on $\theta^1$.
Figure 2: Instantaneous loss $\widehat{J}_t$ and running average. Convergence to BN loss (0.012) observed.
Figure 3: Policy $\pi_{\theta^1,\lambda}(\cdot|0)$ vs. $\log_{10}\lambda$. Transition from uniform to deterministic.
Figure 4: Value $v_{\theta^1,\lambda}$ vs. $\log_{10}\lambda$. Values converge smoothly to the unregularized limit.
Figure 5: Selected $\epsilon_t$. The algorithm rapidly concentrates on the true model region ($\epsilon \approx 0$).
...and 1 more figures

Theorems & Definitions (21)

definition 1: Subjective MDP
definition 2: Subjectively Best-Response Policy
definition 3: Long-Run KL Divergence
definition 4: Pseudo-True Parameter Set
definition 5: Infinite-Horizon Berk--Nash Solution
lemma 1: Properties of the pseudo-true parameter correspondence
proof
lemma 2: Properties of the best-response correspondence
proof
theorem 1: Existence of Infinite-Horizon BN Solution
...and 11 more

A Mathematical Programming Approach to Computing and Learning Berk--Nash Equilibria in Infinite-Horizon MDPs

Abstract

A Mathematical Programming Approach to Computing and Learning Berk--Nash Equilibria in Infinite-Horizon MDPs

Authors

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (21)