Run-and-tumble chemotaxis using reinforcement learning
Ramesh Pramanik, Shradha Mishra, Sakuntala Chatterjee
TL;DR
We address how reinforcement learning can capture run-and-tumble chemotaxis in spatial attractant gradients and learn the environment. The approach uses a one-dimensional RL framework with two actions (Persist, Reverse), a history-based cost from $[L](x(t-\Delta_1))$ vs $[L](x(t-\Delta_2))$ with $\Delta_1=1$, $\Delta_2=2$, and $Q$-learning governed by $\alpha$ and $\epsilon$, applied to sinusoidal and multi-peak attractant profiles. The findings show that long-time localization, quantified by uptake $\langle C\rangle$, and the ability to learn the full landscape depend nontrivially on $\epsilon$ and $\alpha$, with trapping effects when peaks are unequal and clear optimal regions in the $\epsilon$–$\alpha$ plane; the mean run duration in homogeneous settings is $\tau=\dfrac{2}{\epsilon(1-p_0)}$, while structured environments exhibit nonmonotonic $\tau(\epsilon)$ and optimal search times via mean first passage times. The work provides a quantitative link between reinforcement-learning strategies and chemotactic navigation, offering insights into bacterial behavior and guiding the design of RL-guided microrobots operating in gradient fields.
Abstract
Bacterial cells use run-and-tumble motion to climb up attractant concentration gradient in their environment. By extending the uphill runs and shortening the downhill runs the cells migrate towards the higher attractant zones. Motivated by this, we formulate a reinforcement learning (RL) algorithm where an agent moves in one dimension in the presence of an attractant gradient. The agent can perform two actions: either persistent motion in the same direction or reversal of direction. We assign costs for these actions based on the recent history of the agent's trajectory. We ask the question: which RL strategy works best in different types of attractant environment. We quantify efficiency of the RL strategy by the ability of the agent (a) to localize in the favorable zones after large times, and (b) to learn about its complete environment. Depending on the attractant profile and the initial condition, we find an optimum balance is needed between exploration and exploitation to ensure the most efficient performance.
