Table of Contents
Fetching ...

Quantum-Inspired Reinforcement Learning in the Presence of Epistemic Ambivalence

Alireza Habibi, Saeed Ghoorchian, Setareh Maghsudi

TL;DR

The paper tackles decision-making under epistemic ambivalence (EA) by introducing EA-MDP, a quantum-inspired framework that encodes EA as a Hilbert-space state attached to classical environment states. Rewards are derived from quantum measurements using a reward operator, yielding an expectation over measurement outcomes, and the authors prove the existence of an optimal policy and value function, along with an EA-specific $\varepsilon$-greedy Q-learning algorithm. They validate the approach on two model problems: a two-site system and a many-site lattice, showing convergence to optimal policies and revealing how quantum interference from complex amplitudes modulates learning and path selection. The work demonstrates how EA can be explicitly modeled and controlled in RL through EA bases and measurement settings, offering a scalable, tunable mechanism to handle persistent conflicting evidence in online decision problems. Future work includes time-dependent EA states, richer outcome sets, and partial observability to broaden applicability and realism in real-world settings.

Abstract

The complexity of online decision-making under uncertainty stems from the requirement of finding a balance between exploiting known strategies and exploring new possibilities. Naturally, the uncertainty type plays a crucial role in developing decision-making strategies that manage complexity effectively. In this paper, we focus on a specific form of uncertainty known as epistemic ambivalence (EA), which emerges from conflicting pieces of evidence or contradictory experiences. It creates a delicate interplay between uncertainty and confidence, distinguishing it from epistemic uncertainty that typically diminishes with new information. Indeed, ambivalence can persist even after additional knowledge is acquired. To address this phenomenon, we propose a novel framework, called the epistemically ambivalent Markov decision process (EA-MDP), aiming to understand and control EA in decision-making processes. This framework incorporates the concept of a quantum state from the quantum mechanics formalism, and its core is to assess the probability and reward of every possible outcome. We calculate the reward function using quantum measurement techniques and prove the existence of an optimal policy and an optimal value function in the EA-MDP framework. We also propose the EA-epsilon-greedy Q-learning algorithm. To evaluate the impact of EA on decision-making and the expedience of our framework, we study two distinct experimental setups, namely the two-state problem and the lattice problem. Our results show that using our methods, the agent converges to the optimal policy in the presence of EA.

Quantum-Inspired Reinforcement Learning in the Presence of Epistemic Ambivalence

TL;DR

The paper tackles decision-making under epistemic ambivalence (EA) by introducing EA-MDP, a quantum-inspired framework that encodes EA as a Hilbert-space state attached to classical environment states. Rewards are derived from quantum measurements using a reward operator, yielding an expectation over measurement outcomes, and the authors prove the existence of an optimal policy and value function, along with an EA-specific -greedy Q-learning algorithm. They validate the approach on two model problems: a two-site system and a many-site lattice, showing convergence to optimal policies and revealing how quantum interference from complex amplitudes modulates learning and path selection. The work demonstrates how EA can be explicitly modeled and controlled in RL through EA bases and measurement settings, offering a scalable, tunable mechanism to handle persistent conflicting evidence in online decision problems. Future work includes time-dependent EA states, richer outcome sets, and partial observability to broaden applicability and realism in real-world settings.

Abstract

The complexity of online decision-making under uncertainty stems from the requirement of finding a balance between exploiting known strategies and exploring new possibilities. Naturally, the uncertainty type plays a crucial role in developing decision-making strategies that manage complexity effectively. In this paper, we focus on a specific form of uncertainty known as epistemic ambivalence (EA), which emerges from conflicting pieces of evidence or contradictory experiences. It creates a delicate interplay between uncertainty and confidence, distinguishing it from epistemic uncertainty that typically diminishes with new information. Indeed, ambivalence can persist even after additional knowledge is acquired. To address this phenomenon, we propose a novel framework, called the epistemically ambivalent Markov decision process (EA-MDP), aiming to understand and control EA in decision-making processes. This framework incorporates the concept of a quantum state from the quantum mechanics formalism, and its core is to assess the probability and reward of every possible outcome. We calculate the reward function using quantum measurement techniques and prove the existence of an optimal policy and an optimal value function in the EA-MDP framework. We also propose the EA-epsilon-greedy Q-learning algorithm. To evaluate the impact of EA on decision-making and the expedience of our framework, we study two distinct experimental setups, namely the two-state problem and the lattice problem. Our results show that using our methods, the agent converges to the optimal policy in the presence of EA.

Paper Structure

This paper contains 19 sections, 6 theorems, 52 equations, 6 figures, 1 algorithm.

Key Result

Theorem 5.1

In an EA-MDP, denoting as ${\tilde{\mathcal{M}} = \langle \tilde{\mathcal{S}}, \tilde{\mathcal{A}}, r, p, \gamma \rangle}$, with a fixed stochastic policy $\pi: \tilde{\mathcal{S}} \rightarrow \Delta(\tilde{\mathcal{A}})$ and a fixed $\tilde{\boldsymbol{s}}(\cdot)$, we have

Figures (6)

  • Figure 1: The optimal value function for two sites example in the presence of EA with parameters ${\tilde{\boldsymbol{r}} = (-1, 1, \tilde{r}(\omega^{(\text{EA})}_{2}))}$, ${\boldsymbol c_1 = (\frac{2}{3}, \frac{2}{3}, \frac{1}{3})}$, and ${\boldsymbol c_2=(\frac{2}{3},\frac{1}{3},\frac{2}{3})}$. The set of outcomes is shown in Equation (\ref{['eq:example1-outcomeset']}). (a) $\tilde{r}(\omega^{(\text{EA})}_{2})=2$ and different values of $\gamma$, (b) $\gamma=0.8$ and different values of $\tilde{r}(\omega^{(\text{EA})}_{2})$.
  • Figure 2: (a) A $5 \times 5$ lattice. Color combinations represent the sites with different EA quantum states. (b) The set of EA bases shown with different colors.
  • Figure 3: The optimal value function in the lattice, given an EA with outcome rewards ${\tilde{\boldsymbol{r}} = (-1, -2, -3, 1)}$, a discount factor $\gamma = 0.9$, probability amplitudes in Equation (\ref{['eq:latticeea']}), and the set of outcomes in Equation (\ref{['eq:lattice_outcome']}). The effect of varying (a) $\phi_1$ and (b) $\phi_2$ on the optimal value function are shown.
  • Figure 4: The optimal value function for the two-site example in the presence of EA is computed with parameters ${\tilde{\boldsymbol{r}} = (-1, 1, 2)}$, ${\boldsymbol c_1 = (\frac{2}{3}, \frac{2}{3} e^{i \theta_1}, \frac{1}{3})}$, and ${\boldsymbol c_2=(\frac{2}{3},\frac{1}{3} e^{i \theta_2},\frac{2}{3})}$. The set of outcomes is defined in Equation (\ref{['eq:example1-outcomeset']}). (a) $V^{\tilde{\boldsymbol{s}}*}({s_1})$, (b) $V^{\tilde{\boldsymbol{s}}*}({s_2})$ for different values of $\theta_1$ and $\theta_2$.
  • Figure 5: The optimal value function in the lattice with EA is computed using outcome rewards ${\tilde{\boldsymbol{r}} = (-1, -2, -3, \tilde{r}(\omega^{(\text{EA})}_3))}$ and probability amplitudes as presented in Equation (\ref{['eq:latticeea']}). The set of outcomes is detailed in Equation (\ref{['eq:lattice_outcome']}), with $\phi_1=\phi_2=0$. The optimal value function is shown in two cases: (a) for different values of $\gamma$ and (b) for different values of $\tilde{r}(\omega^{(\text{EA})}_3)$. At the transition point, the trajectory that maximizes rewards shifts to a new one.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Definition 4.3
  • Theorem 5.1: Bellman equations in EA-MDP
  • proof
  • Theorem 5.2: Bellman contraction for EA-MDP
  • proof
  • Theorem 5.3: Existence of an optimal value function and optimal policy in EA-MDP
  • proof
  • Theorem 3.1
  • Theorem 4.2
  • Theorem 4.3