Table of Contents
Fetching ...

Quantum framework for Reinforcement Learning: Integrating Markov decision process, quantum arithmetic, and trajectory search

Thet Htar Su, Shaswot Shresthamali, Masaaki Kondo

TL;DR

This work addresses the computational bottlenecks of classical reinforcement learning by introducing a fully quantum framework that encodes an MDP in quantum registers and performs agent–environment interactions, return computation, and trajectory search entirely in the quantum domain. The method includes a quantum representation of $S$ and $A$, quantum state transitions via $R_y$ rotations conditioned on state–action pairs, quantum return aggregation, and Grover-based trajectory search to identify high-return paths with a single oracle call. Demonstrations on a four-state, two-action MDP show that the quantum model reproduces classical dynamics and that Grover’s search can recover optimal trajectories in parallel to classical Q-learning results, with indications of improved sample efficiency and speed. The findings suggest a viable path toward quantum-native RL with potential impact on autonomous systems, healthcare, and finance, while highlighting practical challenges in scaling quantum resources and unknown-return search.

Abstract

This paper introduces a quantum framework for addressing reinforcement learning (RL) tasks, grounded in the quantum principles and leveraging a fully quantum model of the classical Markov decision process (MDP). By employing quantum concepts and a quantum search algorithm, this work presents the implementation and optimization of the agent-environment interactions entirely within the quantum domain, eliminating reliance on classical computations. Key contributions include the quantum-based state transitions, return calculation, and trajectory search mechanism that utilize quantum principles to demonstrate the realization of RL processes through quantum phenomena. The implementation emphasizes the fundamental role of quantum superposition in enhancing computational efficiency for RL tasks. Results demonstrate the capacity of a quantum model to achieve quantum enhancement in RL, highlighting the potential of fully quantum implementations in decision-making tasks. This work not only underscores the applicability of quantum computing in machine learning but also contributes to the field of quantum reinforcement learning (QRL) by offering a robust framework for understanding and exploiting quantum computing in RL systems.

Quantum framework for Reinforcement Learning: Integrating Markov decision process, quantum arithmetic, and trajectory search

TL;DR

This work addresses the computational bottlenecks of classical reinforcement learning by introducing a fully quantum framework that encodes an MDP in quantum registers and performs agent–environment interactions, return computation, and trajectory search entirely in the quantum domain. The method includes a quantum representation of and , quantum state transitions via rotations conditioned on state–action pairs, quantum return aggregation, and Grover-based trajectory search to identify high-return paths with a single oracle call. Demonstrations on a four-state, two-action MDP show that the quantum model reproduces classical dynamics and that Grover’s search can recover optimal trajectories in parallel to classical Q-learning results, with indications of improved sample efficiency and speed. The findings suggest a viable path toward quantum-native RL with potential impact on autonomous systems, healthcare, and finance, while highlighting practical challenges in scaling quantum resources and unknown-return search.

Abstract

This paper introduces a quantum framework for addressing reinforcement learning (RL) tasks, grounded in the quantum principles and leveraging a fully quantum model of the classical Markov decision process (MDP). By employing quantum concepts and a quantum search algorithm, this work presents the implementation and optimization of the agent-environment interactions entirely within the quantum domain, eliminating reliance on classical computations. Key contributions include the quantum-based state transitions, return calculation, and trajectory search mechanism that utilize quantum principles to demonstrate the realization of RL processes through quantum phenomena. The implementation emphasizes the fundamental role of quantum superposition in enhancing computational efficiency for RL tasks. Results demonstrate the capacity of a quantum model to achieve quantum enhancement in RL, highlighting the potential of fully quantum implementations in decision-making tasks. This work not only underscores the applicability of quantum computing in machine learning but also contributes to the field of quantum reinforcement learning (QRL) by offering a robust framework for understanding and exploiting quantum computing in RL systems.

Paper Structure

This paper contains 32 sections, 16 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: The agent-environment interaction in a Markov decision process (MDP) sutton2018reinforcementintroduction.
  • Figure 2: Quantum circuit for Grover's algorithm on 2 qubits, searching for the target state $\ket{11}$ generated by Qiskit qiskit2024. The measurement was performed using IBM quantum processor (ibm_brisbane, version: 1.1.62, processor type: Eagle r3, qubits: 127) . Output distribution is displayed on the right, showing the search state $\ket{11}$ with the highest count.
  • Figure 3: Graphical representation of a classical MDP with four states $(s_0,s_1,s_2,s_3)$, two actions $(a_0, a_1)$ and rewards $(r_0, r_1, r_2, r_3)$. The arrows between states indicate the transitions associated with each action. Transition probabilities are labeled on each arrow.
  • Figure 4: Quantum circuit of the quantum Markov decision process (QMDP) simulating a single interaction between the agent and the environment. The circuit encodes states and actions into qubits, allowing the agent to explore multiple states in superposition. $R_y(\theta)$ gates represent probabilistic state transitions based on the environment’s response to the agent's actions, while CNOT gates implement the reward mechanism, conditioned on the resulting states.
  • Figure 5: State transition heat-map representing the probabilities of transition from each state-action pair (on the x-axis) to the next state (on the y-axis) within a single agent-environment interaction in QMDP. Darker cells indicate higher transition probabilities.
  • ...and 10 more figures