Q-learning with temporal memory to navigate turbulence

Marco Rando; Martin James; Alessandro Verri; Lorenzo Rosasco; Agnese Seminara

Q-learning with temporal memory to navigate turbulence

Marco Rando, Martin James, Alessandro Verri, Lorenzo Rosasco, Agnese Seminara

TL;DR

This work tackles the problem of locating an odor source in a turbulent plume using only odor cues, with no spatial perception or prior information. It introduces a map-free reinforcement learning approach that uses a small, interpretable olfactory state space derived from a temporal memory window over odor traces, discretizing intermittency $\bar{i}$ and intensity $\bar{c}$ to guide actions via tabular Q-learning. A key finding is the existence of an optimal memory length $T^*$ that balances ignoring plume blanks against triggering recovery outside the plume; a learned recovery policy and adaptive memory $T=\tau_b^-$ yield cast-and-surge-like behavior and robust generalization across multiple environments. The results suggest that temporal features of odor signals can robustly drive navigation in highly intermittent turbulence, with implications for understanding insect behavior and for designing odor-guided autonomous systems.

Abstract

We consider the problem of olfactory searches in a turbulent environment. We focus on agents that respond solely to odor stimuli, with no access to spatial perception nor prior information about the odor. We ask whether navigation to a target can be learned robustly within a sequential decision making framework. We develop a reinforcement learning algorithm using a small set of interpretable olfactory states and train it with realistic turbulent odor cues. By introducing a temporal memory, we demonstrate that two salient features of odor traces, discretized in few olfactory states, are sufficient to learn navigation in a realistic odor plume. Performance is dictated by the sparse nature of turbulent odors. An optimal memory exists which ignores blanks within the plume and activates a recovery strategy outside the plume. We obtain the best performance by letting agents learn their recovery strategy and show that it is mostly casting cross wind, similar to behavior observed in flying insects. The optimal strategy is robust to substantial changes in the odor plumes, suggesting minor parameter tuning may be sufficient to adapt to different environments.

Q-learning with temporal memory to navigate turbulence

TL;DR

and intensity

to guide actions via tabular Q-learning. A key finding is the existence of an optimal memory length

that balances ignoring plume blanks against triggering recovery outside the plume; a learned recovery policy and adaptive memory

yield cast-and-surge-like behavior and robust generalization across multiple environments. The results suggest that temporal features of odor signals can robustly drive navigation in highly intermittent turbulence, with implications for understanding insect behavior and for designing odor-guided autonomous systems.

Abstract

Paper Structure (5 sections, 22 equations, 5 figures, 3 tables)

This paper contains 5 sections, 22 equations, 5 figures, 3 tables.

Introduction
Results
Discussion
Methods and Materials
Acknowledgements

Figures (5)

Figure 3: Learning a stimulus-response strategy for turbulent navigation. (A) Representation of the search problem with turbulent odor cues obtained from Direct Numerical Simulations of fluid turbulence (grey scale, odor snapshot from the simulations). The discrete position $s$ is hidden; the odor concentration $z_T=z(s(t'),t')|t-T\le t'\le t$ is observed along the trajectory $s(t')$, where $T$ is the sensing memory. (B) Odor traces from direct numerical simulations at different (fixed) points within the plume. Odor is noisy and sparse, information about the source is hidden in the temporal dynamics. (C) Contour maps of olfactory states with nearly infinite memory ($T = 2598$): on average olfactory states map to different locations within the plume and the void state is outside the plume. Intermittency is discretized in three bins defined by two thresholds 66% (red line) and 33% (blue line). Intensity is discretized in 5 bins (dark red shade to white shade) defined by four thresholds (percentiles 99%, 80%, 50%, 25%). (D) Performance of stimulus-response strategies obtained during training, averaged over 500 episodes. We train using realistic turbulent data with memory $T =20$ and backtracking recovery.
Figure 4: The optimal memory $T^*$. (A) Four measures of performance as a function of memory with backtracking recovery (solid line) show that the optimal memory $T^*=20$ maximizes average performance and minimizes standard deviation, except for the normalized time. Top: Averages computed over $10$ realizations of test trajectories starting from $43000$ initial positions (dash: results with adaptive memory). Bottom: standard deviation of the mean performance metrics for each initial condition (see Materials and Methods). (B) Average number of times agents encounter the void state along their path, $\langle N_\emptyset \rangle$, as a function of memory (top); cumulative average reward $\langle G\rangle$ is inversely correlated to $\langle N_\emptyset \rangle$ (bottom), hence the optimal memory minimizes encounters with the void. (C) Colormaps: Probability that agents at different spatial locations are in the void state at any point in time, starting the search form anywhere in the plume and representative trajectory of a successful searcher (green solid line) with memory $T =1$, $T =20$, $T =50$ (left to right). At the optimal memory agents in the void state are concentrated near the edge of the plume. Agents with shorter memories encounter voids throughout the plume; agents with longer memories encounter more voids outside of the plume as they delay recovery. In all panels, shades are $\pm$ standard deviation.
Figure 5: The adaptive memory approximates the duration of the blank dictated by physics and it is an efficient heuristics, especially when coupled with a learned recovery strategy. (A) Top to bottom: Colormaps of the Eulerian average blank time $\tau_b$; average sensing memory $T$; standard deviation of Eulerian blank time and of sensing memory. The sensing memory statistics is computed over all agents that are located at each discrete cell, at any point in time. (B) Probability distribution of $\tau_b$ across all spatial locations and times (black) and of $T$ across all agents at all times (gray). (C) Performance with the adaptive memory nears performance of the optimal fixed memory, here shown for backtracking; similar results apply to the Brownian recovery (Figure 3--figure supplement 2). (D) Comparison of five recovery strategies with adaptive memory: The learned recovery with adaptive memory outperforms all fixed and adaptive memory agents. In (C) and (D) dark squares mark the mean, and light rectangles mark $\pm$ standard deviation. $f^+$ is defined as the fraction of agents that reach the target at test, hence has no standard deviation.
Figure 6: Optimal policies with adaptive memory for different recovery strategies: backtracking (green), Brownian (red) and learned (blue). For each recovery, we show the spatial distribution of the olfactory states (top); the policy (center) and the state occupancy (bottom) for non-void states (left) vs the void state $\pi^*(a|\emptyset)$(right). Spatial distribution: probability that an agent at a given position is in any non-void olfactory state (left) or in the void state (right), color-coded from yellow to blue. Policy: actions learned in the non-void states $\sum_{o\ne\emptyset}n_o\pi^*(a|o)$, weighted on their occupancy $n_o$ (left, arrows proportional to the frequency of the corresponding action) and schematic view of recovery policy in the void state (right). State occupancy: fraction of agents that is in any of the 15 non-void states (left) or in the void state (right) at any point in space and time. Occupancy is proportional to the radius of the corresponding circle. The position of the circle identifies the olfactory state (rows and columns indicate the discrete intensity and intermittency respectively). All statistics is computed over 43000 trajectories, starting from any location within the plume.
Figure 7: Generalization to statistically different environments. (A) Snapshots of odor concentration normalized with concentration at the source, colorcoded from blue (0) to yellow (1) for environment 1 to 6 as labeled. Environment $1^*$ is the native environment where all agents are trained. (B) Performance for the five recovery strategies brownian (red), backtracking (green), learned (blue), circling (orange) and zigzag (purple) with adaptive memory, trained on the native environment and tested across all environments 1 to 6. Four measures of performance defined in the main text are shown. Dark squares mark the mean, and empty rectangles $\pm$ standard deviation. No standard deviation is shown for the $f^+$ measure for the learned, circling and zigzag recoveries as these strategies are deterministic (see Materials and Methods). suppinfo 6---figure supplement 10.Generalization for a reduced model with a single non-empty olfactory state. The learned recovery with adaptive memory and a single non-empty olfactory state (empty circles) displays degraded performance with respect to the full model (full circles). Generalization for a reduced model with a single non-empty olfactory state.suppinfo 6---figure supplement 10. The learned recovery with adaptive memory and a single non-empty olfactory state (empty circles) displays degraded performance with respect to the full model (full circles).

Q-learning with temporal memory to navigate turbulence

TL;DR

Abstract

Q-learning with temporal memory to navigate turbulence

Authors

TL;DR

Abstract

Table of Contents

Figures (5)