Continuously evolving rewards in an open-ended environment

Richard M. Bailey

Continuously evolving rewards in an open-ended environment

Richard M. Bailey

TL;DR

The paper tackles how agents in open-ended environments can adapt their goals when rewards are not externally fixed. It introduces RULE (Reward Updating through Learning and Expectation), a mechanism for endogenously updating reward coefficients across generations during continuous RL, tested in a simplified ecosystem with Ents and primary producers. Results show populations can abandon detrimental learned behaviours, adjust to novel items like vitamins, and sustain survival under shifting conditions; the approach also reveals interactions with evolution and highlights limitations such as potential reward-hacking risks and the challenge of dormant or expanding reward components. Overall, RULE demonstrates a plausible pathway for endowing agents with dynamic, environment-responsive objectives, with implications for ecological, economic, and multi-agent systems.

Abstract

Unambiguous identification of the rewards driving behaviours of entities operating in complex open-ended real-world environments is difficult, partly because goals and associated behaviours emerge endogenously and are dynamically updated as environments change. Reproducing such dynamics in models would be useful in many domains, particularly where fixed reward functions limit the adaptive capabilities of agents. Simulation experiments described assess a candidate algorithm for the dynamic updating of rewards, RULE: Reward Updating through Learning and Expectation. The approach is tested in a simplified ecosystem-like setting where experiments challenge entities' survival, calling for significant behavioural change. The population of entities successfully demonstrate the abandonment of an initially rewarded but ultimately detrimental behaviour, amplification of beneficial behaviour, and appropriate responses to novel items added to their environment. These adjustment happen through endogenous modification of the entities' underlying reward function, during continuous learning, without external intervention.

Continuously evolving rewards in an open-ended environment

TL;DR

Abstract

Paper Structure (54 sections, 33 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 54 sections, 33 equations, 8 figures, 7 tables, 1 algorithm.

Introduction and context
This contribution
Scope and Problem statement
Problem statement
Simulated open-world environment
Simulated physical environment
Primary producers (PP)
Ents
Energy
Currencies
Life-cycle, reproduction and inheritance
Reinforcement Learning (RL)
Observations
Actions
Rewards and Reward updating
...and 39 more sections

Figures (8)

Figure 1: An image of the model running, showing the environment components.
Figure 2: (a) Black symbols indicate the total reward obtained at each step during the initial policy training. Data in light grey are the birth number of each Ent born during the training period, plotted against the time since its population was initiated (which occurs each time the population collapses). Vertical plotting data indicate bursts of reproduction, leading to populations which ultimately collapse; more diagonal data indicate a more consistent birth rate. Part (b) shows the first 3,100 births in greater detail, indicating the points in training where multiples of the maximum lifespan (200 s) are observed, indicating increasingly successful Ent survival and reproduction. Below "2nd", no successful offspring are created; above "2nd" are second generation Ents; "3rd" and "4th" are not strictly generation numbers, but indicate increasing population persistence as Ents are born at times which can only be reached due to multiples of the maximum lifespan.
Figure 3: Expected-rewards distributions $E_i(\tau)$ are shown here for Currency, Consumption and Reproduction reward coefficients. Black squares indicate the mean values (at each time point) from the last 500 Ents born in the 10,000 s simulation. These values are the 'Baseline reward expectations', following the initial policy training based on the chosen reward function coefficients (red circles in the figures relate to the 'coins challenge' and are discussed below).
Figure 4: Results from the 'Coins sensitivity' experiment. (a) Final population after 10,000 s at the constant coin rate shown in the horizontal axis. The inset shows the probability of population collapse. (b) Example Ent population time series for a range of coin rates.
Figure 5: Results from Experiment 2: the 'coins challenge'. Grey lines in parts (a)-(h) are data from control conditions in which the coin rate was kept constant at 10 coins/cycle. The dashed black lines indicate the availability of new coins. In part (i) blue symbols indicate coin reward coefficients $\theta_4$ for each Ent offspring, plotted against the time of its birth. Further details are available in the main text.
...and 3 more figures

Continuously evolving rewards in an open-ended environment

TL;DR

Abstract

Continuously evolving rewards in an open-ended environment

Authors

TL;DR

Abstract

Table of Contents

Figures (8)