Table of Contents
Fetching ...

Pretty darn good control: when are approximate solutions better than approximate models

Felipe Montealegre-Mora, Marcus Lapeyrolerie, Melissa Chapman, Abigail G. Keller, Carl Boettiger

TL;DR

The paper investigates when an optimal solution for a simplified, stylized fishery model can outperform an approximate solution for a more realistic, complex model. It demonstrates that deep reinforcement learning (DRL) can learn policy functions directly from interactions with four progressively complex multi-species fisheries models, achieving higher long-term rewards and fewer near-extinction events than classical constant-mortality or constant-escapement policies, especially in the most complex scenarios. A key finding is that DRL not only matches the interpretable form of escapement-like controls but also tailors responses to the broader state of the ecosystem, yielding robust performance under parameter uncertainty. This work suggests that expressive, model-free control can provide practically useful, data-driven policies for ecological management where traditional methods struggle with complexity and uncertainty.

Abstract

Existing methods for optimal control struggle to deal with the complexity commonly encountered in real-world systems, including dimensionality, process error, model bias and data heterogeneity. Instead of tackling these system complexities directly, researchers have typically sought to simplify models to fit optimal control methods. But when is the optimal solution to an approximate, stylized model better than an approximate solution to a more accurate model? While this question has largely gone unanswered owing to the difficulty of finding even approximate solutions for complex models, recent algorithmic and computational advances in deep reinforcement learning (DRL) might finally allow us to address these questions. DRL methods have to date been applied primarily in the context of games or robotic mechanics, which operate under precisely known rules. Here, we demonstrate the ability for DRL algorithms using deep neural networks to successfully approximate solutions (the "policy function" or control rule) in a non-linear three-variable model for a fishery without knowing or ever attempting to infer a model for the process itself. We find that the reinforcement learning agent discovers an effective simplification of the problem to obtain an interpretable control rule. We show that the policy obtained with DRL is both more profitable and more sustainable than any constant mortality policy -- the standard family of policies considered in fishery management.

Pretty darn good control: when are approximate solutions better than approximate models

TL;DR

The paper investigates when an optimal solution for a simplified, stylized fishery model can outperform an approximate solution for a more realistic, complex model. It demonstrates that deep reinforcement learning (DRL) can learn policy functions directly from interactions with four progressively complex multi-species fisheries models, achieving higher long-term rewards and fewer near-extinction events than classical constant-mortality or constant-escapement policies, especially in the most complex scenarios. A key finding is that DRL not only matches the interpretable form of escapement-like controls but also tailors responses to the broader state of the ecosystem, yielding robust performance under parameter uncertainty. This work suggests that expressive, model-free control can provide practically useful, data-driven policies for ecological management where traditional methods struggle with complexity and uncertainty.

Abstract

Existing methods for optimal control struggle to deal with the complexity commonly encountered in real-world systems, including dimensionality, process error, model bias and data heterogeneity. Instead of tackling these system complexities directly, researchers have typically sought to simplify models to fit optimal control methods. But when is the optimal solution to an approximate, stylized model better than an approximate solution to a more accurate model? While this question has largely gone unanswered owing to the difficulty of finding even approximate solutions for complex models, recent algorithmic and computational advances in deep reinforcement learning (DRL) might finally allow us to address these questions. DRL methods have to date been applied primarily in the context of games or robotic mechanics, which operate under precisely known rules. Here, we demonstrate the ability for DRL algorithms using deep neural networks to successfully approximate solutions (the "policy function" or control rule) in a non-linear three-variable model for a fishery without knowing or ever attempting to infer a model for the process itself. We find that the reinforcement learning agent discovers an effective simplification of the problem to obtain an interpretable control rule. We show that the policy obtained with DRL is both more profitable and more sustainable than any constant mortality policy -- the standard family of policies considered in fishery management.
Paper Structure (18 sections, 16 equations, 14 figures, 1 table)

This paper contains 18 sections, 16 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: An experimental-design type of visualization of the management scenarios considered in this paper. On the x-axis are four different fishery management problems (Table 1). We represent the Model 4’s non-stationarity with a clock next to the X variable, and we intend to use it as an example of a possible simplified model for the effects of climate change. On the y-axis we have different management strategies with which one may control each of the models. On the bottom we have the constant escapement strategy (CEsc), based on calling off all fishing below a certain threshold population value. Above that is the constant mortality strategy (CMort), where one optimizes over constant fishing effort strategies. Finally, on top we have DRL-based strategies where policies are in general functions of the full state of the system, and they are parametrized by a neural network. The specific DRL-based strategy is referred to as PPO+GP in the main text, due to the algorithm used to produce the policy. The results plotted are the average reward obtained by the strategy over 100 episodes, and the fraction of those episodes which do not end with a near-extinction event (denoted Perc for Percentage). We have normalized to the highest reward in each column in order to enhance the comparison between strategies. For illustrative purposes we have color-coded the results using a two-dimensional color legend displayed on the bottom left.
  • Figure 2: The fixed point diagram for the unharvested dynamics of Model 1 as a function of varying the parameter $\beta H$, assuming zero noise. Stable fixed points (also known as attractors) are plotted using a solid line, while the unstable fixed point is shown as a dotted line.
  • Figure 3: Visualization of the cosntant escapement strategy tuning procedure for Model 4. There was a certain multiplicity in this tuning strategy: a ‘‘ridge of optimality’’ where policies had essentially equivalent behavior. Throughout our investigation, we tuned constant escapment in several occasions and, on each occassion, a different optimal policy along the ridge was found. The results for different policies along the ridge were in practice equivalent, with no discernible difference in performance. We highlighted the ridge with a white dotted line.
  • Figure 4: Reward distributions for the four strategies considered. These are based on 100 evaluation episodes. We denote CEsc for constant escapement, CMort for constant mortality, PPO for the output policy of the PPO optimization algorithm, and PPO GP for the Gaussian process interpolation of the PPO policy.
  • Figure 5: Hisotgrams of episode lengths and rewards for the four different management strategies considered. Only the first 50 evaluation episodes (from a total of 100) were included, for ease of visualization. From left to right, the four management strategies compared are CEsc, CMort, PPO, and PPO+GP.
  • ...and 9 more figures