Table of Contents
Fetching ...

Using Quantum Solved Deep Boltzmann Machines to Increase the Data Efficiency of RL Agents

Daniel Kent, Clement O'Rourke, Jake Southall, Kirsty Duncan, Adrian Bedford

TL;DR

The paper tackles data-efficient reinforcement learning in cyber-defense contexts where training data is scarce. It replaces SB3 PPO's policy and value networks with quantum-hybrid Deep Boltzmann Machines trained via the D-Wave quantum annealer, integrating into the PPO framework and enabling sampling-based energy estimates for $V_{\pi}(s)$ and $Q_{\pi}(s,a)$. Evaluated on PrimAITE with a six-node network and a large action space, the approach achieves approximately a 2× reduction in data requirements while maintaining accuracy, albeit with significantly longer wall-clock training times. This demonstrates the potential of quantum-assisted RL to reduce data needs in real-world security environments, while highlighting hardware costs and trade-offs and outlining future work on statistical data-efficiency, transfer learning, and multi-agent extensions.

Abstract

Deep Learning algorithms, such as those used in Reinforcement Learning, often require large quantities of data to train effectively. In most cases, the availability of data is not a significant issue. However, for some contexts, such as in autonomous cyber defence, we require data efficient methods. Recently, Quantum Machine Learning and Boltzmann Machines have been proposed as solutions to this challenge. In this work we build upon the pre-existing work to extend the use of Deep Boltzmann Machines to the cutting edge algorithm Proximal Policy Optimisation in a Reinforcement Learning cyber defence environment. We show that this approach, when solved using a D-WAVE quantum annealer, can lead to a two-fold increase in data efficiency. We therefore expect it to be used by the machine learning and quantum communities who are hoping to capitalise on data-efficient Reinforcement Learning methods.

Using Quantum Solved Deep Boltzmann Machines to Increase the Data Efficiency of RL Agents

TL;DR

The paper tackles data-efficient reinforcement learning in cyber-defense contexts where training data is scarce. It replaces SB3 PPO's policy and value networks with quantum-hybrid Deep Boltzmann Machines trained via the D-Wave quantum annealer, integrating into the PPO framework and enabling sampling-based energy estimates for and . Evaluated on PrimAITE with a six-node network and a large action space, the approach achieves approximately a 2× reduction in data requirements while maintaining accuracy, albeit with significantly longer wall-clock training times. This demonstrates the potential of quantum-assisted RL to reduce data needs in real-world security environments, while highlighting hardware costs and trade-offs and outlining future work on statistical data-efficiency, transfer learning, and multi-agent extensions.

Abstract

Deep Learning algorithms, such as those used in Reinforcement Learning, often require large quantities of data to train effectively. In most cases, the availability of data is not a significant issue. However, for some contexts, such as in autonomous cyber defence, we require data efficient methods. Recently, Quantum Machine Learning and Boltzmann Machines have been proposed as solutions to this challenge. In this work we build upon the pre-existing work to extend the use of Deep Boltzmann Machines to the cutting edge algorithm Proximal Policy Optimisation in a Reinforcement Learning cyber defence environment. We show that this approach, when solved using a D-WAVE quantum annealer, can lead to a two-fold increase in data efficiency. We therefore expect it to be used by the machine learning and quantum communities who are hoping to capitalise on data-efficient Reinforcement Learning methods.
Paper Structure (8 sections, 3 equations, 7 figures)

This paper contains 8 sections, 3 equations, 7 figures.

Figures (7)

  • Figure 1: Visualisation of how actions are chosen by the policy network, evaluated by the environment, and then used as inputs to the value network. This gives a value function estimate for a given action, observation and reward. In this work we use DBMs for both the policy and value networks.
  • Figure 2: Visualisation of the connections between units in a Boltzmann Machine. The output of this machine is the energy defined by \ref{['eq:bm']}.
  • Figure 3: Visualisation of the connections between units in a clamped Boltzmann Machine. The visible units are clamped to either $0$ or $1$, representing a fixed input and output, but the hidden units can vary. In this work, the output of this machine is the energy defined by \ref{['eq:free_energy']}.
  • Figure 4: Visualisation of the connections between units in a DBM. In this work, the output of this machine is the energy defined by \ref{['eq:free_energy']}.
  • Figure 5: PrimAITE's pre-packaged six-node network that we used in this work. Red agent attacks originated in either of the PC network endpoints and had the ultimate goal of sabotaging the server endpoint.
  • ...and 2 more figures