Table of Contents
Fetching ...

Reinforcement Learning for Control of Non-Markovian Cellular Population Dynamics

Josiah C. Kratz, Jacob Adamczyk

TL;DR

This work addresses controlling non-Markovian cellular populations with memory effects to suppress proliferation under drug dosing, formalized through a memory kernel with memory strength $\mu\in(0,1]$ and a final objective $C=\log N(T)/N(0)$ where $N(t)=S(t)+R(t)$. It introduces a memory-enabled two-state model with susceptible $S$ and resistant $R$ cells and proves that the optimal control is bang-bang under monotone dose–response, guiding the use of end-to-end reinforcement learning to discover effective dosing policies. The authors demonstrate that model-free deep RL can recover the memoryless optimal policy and, when memory is present, learn robust, memory-aware dosing strategies even under observation noise, outperforming baselines. They further enhance generalization with domain randomization over memory strength and distributional RL (FQF) to cope with uncertain memory and noise, achieving strong performance across scenarios. The results highlight the potential for RL-guided adaptive dosing in clinical contexts, offering practical, bang-bang policies that remain effective despite non-Markovian dynamics and measurement perturbations.

Abstract

Many organisms and cell types, from bacteria to cancer cells, exhibit a remarkable ability to adapt to fluctuating environments. Additionally, cells can leverage a memory of past environments to better survive previously-encountered stressors. From a control perspective, this adaptability poses significant challenges in driving cell populations toward extinction, and thus poses an open question with great clinical significance. In this work, we focus on drug dosing in cell populations exhibiting phenotypic plasticity. For specific dynamical models switching between resistant and susceptible states, exact solutions are known. However, when the underlying system parameters are unknown, and for complex memory-based systems, obtaining the optimal solution is currently intractable. To address this challenge, we apply reinforcement learning (RL) to identify informed dosing strategies to control cell populations evolving under novel non-Markovian dynamics. We find that model-free deep RL is able to recover exact solutions and control cell populations even in the presence of long-range temporal dynamics. To further test our approach in more realistic settings, we demonstrate robust RL-based control strategies in environments with measurement noise and dynamic memory strength.

Reinforcement Learning for Control of Non-Markovian Cellular Population Dynamics

TL;DR

This work addresses controlling non-Markovian cellular populations with memory effects to suppress proliferation under drug dosing, formalized through a memory kernel with memory strength and a final objective where . It introduces a memory-enabled two-state model with susceptible and resistant cells and proves that the optimal control is bang-bang under monotone dose–response, guiding the use of end-to-end reinforcement learning to discover effective dosing policies. The authors demonstrate that model-free deep RL can recover the memoryless optimal policy and, when memory is present, learn robust, memory-aware dosing strategies even under observation noise, outperforming baselines. They further enhance generalization with domain randomization over memory strength and distributional RL (FQF) to cope with uncertain memory and noise, achieving strong performance across scenarios. The results highlight the potential for RL-guided adaptive dosing in clinical contexts, offering practical, bang-bang policies that remain effective despite non-Markovian dynamics and measurement perturbations.

Abstract

Many organisms and cell types, from bacteria to cancer cells, exhibit a remarkable ability to adapt to fluctuating environments. Additionally, cells can leverage a memory of past environments to better survive previously-encountered stressors. From a control perspective, this adaptability poses significant challenges in driving cell populations toward extinction, and thus poses an open question with great clinical significance. In this work, we focus on drug dosing in cell populations exhibiting phenotypic plasticity. For specific dynamical models switching between resistant and susceptible states, exact solutions are known. However, when the underlying system parameters are unknown, and for complex memory-based systems, obtaining the optimal solution is currently intractable. To address this challenge, we apply reinforcement learning (RL) to identify informed dosing strategies to control cell populations evolving under novel non-Markovian dynamics. We find that model-free deep RL is able to recover exact solutions and control cell populations even in the presence of long-range temporal dynamics. To further test our approach in more realistic settings, we demonstrate robust RL-based control strategies in environments with measurement noise and dynamic memory strength.

Paper Structure

This paper contains 16 sections, 18 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Depiction of the deterministic phenotypic switching model. The susceptible subpopulation $S$ transitions to the resistant state $R$ at a concentration-dependent rate, $\delta$. Similarly, the resistant state switches back to a susceptible state at a rate $\alpha$. The subpopulations each have a concentration-dependent growth or death rate, $\kappa$.
  • Figure 2: Left: The learned policy shows a resemblance to the optimal memoryless strategy, with an initial constant application phase followed by a pulsatile phase. However, in the case of memory-based dynamics, the frequency of pulsing must be increased over time as discussed in Sec. \ref{['sec:results']}. Since the policy is eventually limited by the simulation time, the pulsing frequency becomes bottlenecked by our choice of time discretization $\Delta$ after $\approx 20$ hours. Despite this, the policy is still able to perform well with rapid pulsing. Right: Effect of learned policy on resistant fraction. for different memory strengths. The RL agent finds (for distinct $\mu$ values) appropriate lower and upper bounds for the fraction of resistant cells. Maintaining the subpopulation in this range ensures the population can be controlled.
  • Figure 3: Performance comparison of constant drug application, solution for the memoryless case, resistant fraction-based pulsing technique, and policy learned by RL. For the fraction-based policy, an optimal lower and upper bound for resistant fractions are found through sweeping (Appendix \ref{['sec:memoryless']}). The RL policy is capable of controlling the cell population better than any other scheme.
  • Figure 4: PPO and SAC fail to find a bang-bang control policy and have a lower performance than DQN; highlighting the need for discrete action algorithms, as informed by optimal control.
  • Figure 5: Left: Policy is robust to observation noise, as tested in a memoryless environment. To represent potential measurement errors in the clinical setting, noise is drawn from a normal distribution with standard deviation $\sigma$ and added to each state before given to the agent. Despite large amounts of observation noise, the agent was able to drive population reduction and maintain a similar resistant fraction. Each trace represents the mean over $10$ trajectories, with the standard error represented by the shaded region. Right: Learned policy is general and robust to changes in memory strength. Every $20$ decision steps the memory strength is reset to a new value, drawn from a uniform distribution over the interval $[0.6,1]$. Agent is able to quickly adapt to mitigate population growth in the new environment.
  • ...and 2 more figures