Table of Contents
Fetching ...

Optimizing Variational Quantum Circuits Using Metaheuristic Strategies in Reinforcement Learning

Michael Kölle, Daniel Seidl, Maximilian Zorn, Philipp Altmann, Jonas Stein, Thomas Gabor

TL;DR

This work tackles the challenge of optimizing variational quantum circuits within Quantum Reinforcement Learning (QRL) by employing gradient-free metaheuristics to overcome flat landscapes and vanishing gradients. It systematically evaluates Simulated Annealing, Particle Swarm Optimization, Ant Colony Optimization, Tabu Search, Harmony Search, and Genetic Algorithms against a GA baseline in two RL tasks—$5××5$ MiniGrid and Cart Pole—using VQCs with MPS or amplitude encoding. The results show that PSO and SA consistently deliver strong performance, with PSO offering the fastest learning and highest stability across environments, while GA demonstrates strong potential given longer runtimes; ACO, HS, and TS generally underperform or show lower robustness. These findings highlight the practical potential of gradient-free metaheuristics for efficient QRL and underscore the importance of careful algorithm selection and adaptation to the problem setting, with future work pointed toward more complex tasks, adaptive hyperparameters, and hardware experiments.

Abstract

Quantum Reinforcement Learning (QRL) offers potential advantages over classical Reinforcement Learning, such as compact state space representation and faster convergence in certain scenarios. However, practical benefits require further validation. QRL faces challenges like flat solution landscapes, where traditional gradient-based methods are inefficient, necessitating the use of gradient-free algorithms. This work explores the integration of metaheuristic algorithms -- Particle Swarm Optimization, Ant Colony Optimization, Tabu Search, Genetic Algorithm, Simulated Annealing, and Harmony Search -- into QRL. These algorithms provide flexibility and efficiency in parameter optimization. Evaluations in $5\times5$ MiniGrid Reinforcement Learning environments show that, all algorithms yield near-optimal results, with Simulated Annealing and Particle Swarm Optimization performing best. In the Cart Pole environment, Simulated Annealing, Genetic Algorithms, and Particle Swarm Optimization achieve optimal results, while the others perform slightly better than random action selection. These findings demonstrate the potential of Particle Swarm Optimization and Simulated Annealing for efficient QRL learning, emphasizing the need for careful algorithm selection and adaptation.

Optimizing Variational Quantum Circuits Using Metaheuristic Strategies in Reinforcement Learning

TL;DR

This work tackles the challenge of optimizing variational quantum circuits within Quantum Reinforcement Learning (QRL) by employing gradient-free metaheuristics to overcome flat landscapes and vanishing gradients. It systematically evaluates Simulated Annealing, Particle Swarm Optimization, Ant Colony Optimization, Tabu Search, Harmony Search, and Genetic Algorithms against a GA baseline in two RL tasks— MiniGrid and Cart Pole—using VQCs with MPS or amplitude encoding. The results show that PSO and SA consistently deliver strong performance, with PSO offering the fastest learning and highest stability across environments, while GA demonstrates strong potential given longer runtimes; ACO, HS, and TS generally underperform or show lower robustness. These findings highlight the practical potential of gradient-free metaheuristics for efficient QRL and underscore the importance of careful algorithm selection and adaptation to the problem setting, with future work pointed toward more complex tasks, adaptive hyperparameters, and hardware experiments.

Abstract

Quantum Reinforcement Learning (QRL) offers potential advantages over classical Reinforcement Learning, such as compact state space representation and faster convergence in certain scenarios. However, practical benefits require further validation. QRL faces challenges like flat solution landscapes, where traditional gradient-based methods are inefficient, necessitating the use of gradient-free algorithms. This work explores the integration of metaheuristic algorithms -- Particle Swarm Optimization, Ant Colony Optimization, Tabu Search, Genetic Algorithm, Simulated Annealing, and Harmony Search -- into QRL. These algorithms provide flexibility and efficiency in parameter optimization. Evaluations in MiniGrid Reinforcement Learning environments show that, all algorithms yield near-optimal results, with Simulated Annealing and Particle Swarm Optimization performing best. In the Cart Pole environment, Simulated Annealing, Genetic Algorithms, and Particle Swarm Optimization achieve optimal results, while the others perform slightly better than random action selection. These findings demonstrate the potential of Particle Swarm Optimization and Simulated Annealing for efficient QRL learning, emphasizing the need for careful algorithm selection and adaptation.
Paper Structure (27 sections, 9 equations, 3 figures, 2 tables)

This paper contains 27 sections, 9 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Quantum Circuit for $5\times5$ MiniGrid Environment. The $5\times5$ MiniGrid input is reduced to a dimension of 8 using MPS. Eight qubits are initialized and manipulated with Hadamard gates into superposition. The 8-dimensional input is encoded using $R_Y$ and $R_Z$ rotational gates. The variational part involves entangling all qubits and applying rotation gates with trainable parameters $\alpha_i$, $\beta_i$, and $\gamma_i$. Measurements on 6 of 8 qubits represent the action set $\mathcal{A} = 6$, with action probabilities derived using a softmax function (see Chen et al.Chen_2022).
  • Figure 2: Quantum Circuit for Cart Pole Environment of Chen et al.Chen_2022. The $U(x)$ gate represents the encoding, here realized using Amplitude Encoding. The variational part consists of a CNOT gate, followed by parameterized rotational gates with trainable parameters $\alpha_i$, $\beta_i$, and $\gamma_i$. The variational part is repeated four times. The output is a 2-dimensional tuple $[a,b]$, determining the cart movement direction.
  • Figure 3: Results of each metaheuristic in optimizing the RL agent in the $5\times5$ MiniGrid and Cart Pole environment. To ensure a fair comparison, the number of iterations was adjusted to match an equal runtime. The plots show the highest reward achieved (Y-axis) over wall-clock runtime (X-axis) for each of the 5 runs, with results visualized using the mean and 95% confidence interval. Each metaheuristic reward was averaged over 5 runs.