Table of Contents
Fetching ...

VQC-Based Reinforcement Learning with Data Re-uploading: Performance and Trainability

Rodrigo Coelho, André Sequeira, Luís Paulo Santos

TL;DR

Results indicate that the VQC-based Deep Q-Learning models may still be able to find large gradients throughout training, allowing for learning, even if the probability of being initialized in a Barren Plateau increases exponentially with system size for Hardware-Efficient ansatzes.

Abstract

Reinforcement Learning (RL) consists of designing agents that make intelligent decisions without human supervision. When used alongside function approximators such as Neural Networks (NNs), RL is capable of solving extremely complex problems. Deep Q-Learning, a RL algorithm that uses Deep NNs, achieved super-human performance in some specific tasks. Nonetheless, it is also possible to use Variational Quantum Circuits (VQCs) as function approximators in RL algorithms. This work empirically studies the performance and trainability of such VQC-based Deep Q-Learning models in classic control benchmark environments. More specifically, we research how data re-uploading affects both these metrics. We show that the magnitude and the variance of the gradients of these models remain substantial throughout training due to the moving targets of Deep Q-Learning. Moreover, we empirically show that increasing the number of qubits does not lead to an exponential vanishing behavior of the magnitude and variance of the gradients for a PQC approximating a 2-design, unlike what was expected due to the Barren Plateau Phenomenon. This hints at the possibility of VQCs being specially adequate for being used as function approximators in such a context.

VQC-Based Reinforcement Learning with Data Re-uploading: Performance and Trainability

TL;DR

Results indicate that the VQC-based Deep Q-Learning models may still be able to find large gradients throughout training, allowing for learning, even if the probability of being initialized in a Barren Plateau increases exponentially with system size for Hardware-Efficient ansatzes.

Abstract

Reinforcement Learning (RL) consists of designing agents that make intelligent decisions without human supervision. When used alongside function approximators such as Neural Networks (NNs), RL is capable of solving extremely complex problems. Deep Q-Learning, a RL algorithm that uses Deep NNs, achieved super-human performance in some specific tasks. Nonetheless, it is also possible to use Variational Quantum Circuits (VQCs) as function approximators in RL algorithms. This work empirically studies the performance and trainability of such VQC-based Deep Q-Learning models in classic control benchmark environments. More specifically, we research how data re-uploading affects both these metrics. We show that the magnitude and the variance of the gradients of these models remain substantial throughout training due to the moving targets of Deep Q-Learning. Moreover, we empirically show that increasing the number of qubits does not lead to an exponential vanishing behavior of the magnitude and variance of the gradients for a PQC approximating a 2-design, unlike what was expected due to the Barren Plateau Phenomenon. This hints at the possibility of VQCs being specially adequate for being used as function approximators in such a context.
Paper Structure (28 sections, 13 equations, 11 figures, 7 tables, 2 algorithms)

This paper contains 28 sections, 13 equations, 11 figures, 7 tables, 2 algorithms.

Figures (11)

  • Figure 1: The Agent-Environment Interface: The agent interacts with the environment at time step $t$ by taking action $A_t$. The environment then changes to state $S_{t+1}$ and produces reward $R_{t+1}$, which are both passed back to the agent so that it can decide the next action. The dotted lines indicate that this process repeats itself. Inspired by sutton2018reinforcement.
  • Figure 2: Subfigure \ref{['fig:skolik']}: Skolik's Architecture. When Data Re-Uploading is used, the whole circuit is repeated in each layer. Otherwise, just the part that is not surrounded by dashed lines. Subfigure \ref{['fig:uqc']}: UQC Architecture. Each processing layer $U$ is given by $U^{UAT}(\overrightarrow{x};\overrightarrow{\omega},\alpha,\varphi) = R_y(2\varphi)R_z(2\overrightarrow{\omega}\cdot\overrightarrow{x}+2\alpha)$ and $\overrightarrow{\theta}_i = (\overrightarrow{\omega} , \alpha, \varphi)$. Although a single-qubit ansatz was shown for simplicity, this ansatz can be generalized to allow multiple qubits.
  • Figure 3: Performance of Baseline Models (on the left) and Data Re-Uploading models (on the right) in the CartPole-v0 environment (see Subfigure \ref{['fig:skolik_cartpole']}) and the Acrobot-v1 environment (see Subfigure \ref{['fig:skolik_acrobot']}) with and without trainable input and/or output scaling. The returns are averaged over $10$ agents. The full set of hyperparameters can be seen in Table \ref{['table: skolik_hyper']}.
  • Figure 4: Trainability of the models from Figures \ref{['fig:skolik_cartpole']} and \ref{['fig:skolik_acrobot']} with trainable output scaling in the CartPole-v0 (see Subfigure \ref{['fig:skolik_gradients_cartpole']}) and Acrobot-v1 (see Subfigure \ref{['fig:skolik_gradients_acrobot']}) environments. In both Subfigures, the left graph represents the gradient's norm throughout training and the right graph the variance of the norm.
  • Figure 5: Performance (first graph) and the respective loss functions for increasing values of $C$ of data re-uploading models in the Cartpole-v0 (see Subfigure \ref{['fig:target_loss_cartpole']}) and Acrobot-v1 (see Subfigure \ref{['fig:target_loss_acrobot']}) environments. The full set of hyperparameters can be seen in Table \ref{['table:target_loss']}.
  • ...and 6 more figures