Table of Contents
Fetching ...

Laboratory Experiments of Model-based Reinforcement Learning for Adaptive Optics Control

Jalo Nousiainen, Byron Engler, Markus Kasper, Chang Rajani, Tapio Helin, Cédric T. Heritier, Sascha P. Quanz, Adrian M. Glauser

TL;DR

This work demonstrates that a model-based reinforcement-learning controller, PO4AO, can robustly control a second-stage adaptive optics loop in a laboratory setting, addressing photon noise, temporal delay, and misregistration. By learning a nonlinear policy and a CNN-based dynamics model and running training in parallel with inference, PO4AO achieves substantial improvements in wavefront residuals and PSF contrast over a classical integrator across varying delays, flux levels, and disturbances. The study provides detailed hyperparameters, latency analyses, and an open-source Python implementation, enabling adaptation to other AO systems and real-time pipelines. Overall, the results indicate that PO4AO offers a turnkey, data-driven approach to predictive AO control with practical implications for high-contrast exoplanet imaging, while highlighting avenues to further optimize latency and scalability.

Abstract

Direct imaging of Earth-like exoplanets is one of the most prominent scientific drivers of the next generation of ground-based telescopes. Typically, Earth-like exoplanets are located at small angular separations from their host stars, making their detection difficult. Consequently, the adaptive optics (AO) system's control algorithm must be carefully designed to distinguish the exoplanet from the residual light produced by the host star. A new promising avenue of research to improve AO control builds on data-driven control methods such as Reinforcement Learning (RL). RL is an active branch of the machine learning research field, where control of a system is learned through interaction with the environment. Thus, RL can be seen as an automated approach to AO control, where its usage is entirely a turnkey operation. In particular, model-based reinforcement learning (MBRL) has been shown to cope with both temporal and misregistration errors. Similarly, it has been demonstrated to adapt to non-linear wavefront sensing while being efficient in training and execution. In this work, we implement and adapt an RL method called Policy Optimization for AO (PO4AO) to the GHOST test bench at ESO headquarters, where we demonstrate a strong performance of the method in a laboratory environment. Our implementation allows the training to be performed parallel to inference, which is crucial for on-sky operation. In particular, we study the predictive and self-calibrating aspects of the method. The new implementation on GHOST running PyTorch introduces only around 700 microseconds in addition to hardware, pipeline, and Python interface latency. We open-source well-documented code for the implementation and specify the requirements for the RTC pipeline. We also discuss the important hyperparameters of the method, the source of the latency, and the possible paths for a lower latency implementation.

Laboratory Experiments of Model-based Reinforcement Learning for Adaptive Optics Control

TL;DR

This work demonstrates that a model-based reinforcement-learning controller, PO4AO, can robustly control a second-stage adaptive optics loop in a laboratory setting, addressing photon noise, temporal delay, and misregistration. By learning a nonlinear policy and a CNN-based dynamics model and running training in parallel with inference, PO4AO achieves substantial improvements in wavefront residuals and PSF contrast over a classical integrator across varying delays, flux levels, and disturbances. The study provides detailed hyperparameters, latency analyses, and an open-source Python implementation, enabling adaptation to other AO systems and real-time pipelines. Overall, the results indicate that PO4AO offers a turnkey, data-driven approach to predictive AO control with practical implications for high-contrast exoplanet imaging, while highlighting avenues to further optimize latency and scalability.

Abstract

Direct imaging of Earth-like exoplanets is one of the most prominent scientific drivers of the next generation of ground-based telescopes. Typically, Earth-like exoplanets are located at small angular separations from their host stars, making their detection difficult. Consequently, the adaptive optics (AO) system's control algorithm must be carefully designed to distinguish the exoplanet from the residual light produced by the host star. A new promising avenue of research to improve AO control builds on data-driven control methods such as Reinforcement Learning (RL). RL is an active branch of the machine learning research field, where control of a system is learned through interaction with the environment. Thus, RL can be seen as an automated approach to AO control, where its usage is entirely a turnkey operation. In particular, model-based reinforcement learning (MBRL) has been shown to cope with both temporal and misregistration errors. Similarly, it has been demonstrated to adapt to non-linear wavefront sensing while being efficient in training and execution. In this work, we implement and adapt an RL method called Policy Optimization for AO (PO4AO) to the GHOST test bench at ESO headquarters, where we demonstrate a strong performance of the method in a laboratory environment. Our implementation allows the training to be performed parallel to inference, which is crucial for on-sky operation. In particular, we study the predictive and self-calibrating aspects of the method. The new implementation on GHOST running PyTorch introduces only around 700 microseconds in addition to hardware, pipeline, and Python interface latency. We open-source well-documented code for the implementation and specify the requirements for the RTC pipeline. We also discuss the important hyperparameters of the method, the source of the latency, and the possible paths for a lower latency implementation.
Paper Structure (23 sections, 10 equations, 11 figures, 4 tables, 3 algorithms)

This paper contains 23 sections, 10 equations, 11 figures, 4 tables, 3 algorithms.

Figures (11)

  • Figure 1: Neural network architectures. Both NN, the dynamics, and policy take input tensor concatenations of past actions and observations. They also share the same fully convolutional structure in the first 3-layers. At the output layer, the policy model includes the KL-filtering scheme (upper right corner), and the dynamics model output is multiplied with the WFS mask (lower right corner). For the GHOST, the input and output images are 24x24 pixels (set by the DM).
  • Figure 2: GHOST coronagraphic PSFs. Left: the PSF without any turbulence, and the DM set to be flat. Right: the PSF with simulated 1-stage systems residual phase screens played on SLM, and a flat DM. The speckle at around 1 o'clock is a ghost in the system.
  • Figure 3: PO4AO interface for RTC pipeline. COSMIC pipeline preprocesses the raw WFS data, projects it to DM-space with command matrix (using the modal basis matrices: S2M and V2M), then writes the "delta volts" to the shared memory buffer, and suspends the loop. Python interface (the green box) reads the shared memory buffer and passes the data to the PO4AO implementation. The PO4AO calculates the next command and saves the data (orange boxes), and the Python interface writes the command to shared memory, where COSMIC registers the command and passes it to the saturation management algorithm (SMA) / clipping stage.
  • Figure 4: Learning curves for time delay experiments. The red lines correspond to the performance of PO4AO during each episode, and the blue lines are for the integrator. A single episode is 500 frames. The gray dashed line marks the end of the integrator warm-up for PO4AO. In all cases, the PO4AO outperforms the integrator all ready after the warm-up period. The training is done parallel to control, so the 10 episodes correspond to approximately 14 sec in the figure.
  • Figure 5: PSFs on different additional control delays. The top row is for the integrator control, and the bottom is for PO4AO. The PO4AO and its hyper-parameters are exactly the same for all time delays -- the time delay is learned from the interaction.
  • ...and 6 more figures