Table of Contents
Fetching ...

Online Learning of Temporal Dependencies for Sustainable Foraging Problem

John Payne, Aishwaryaprajna, Peter R. Lewis

TL;DR

This work investigates learning in a one-shot, dynamic foraging environment that embodies a social dilemma. It compares online neuro-evolution and Deep Recurrent Q-Networks (DRQNs) as deliberative architectures and augments both with Long Short-Term Memory to capture temporal dependencies. Results show that, while online learning enables one-shot adaptation, both online NE and DRQN tend to converge on greedy policies that deplete resources in multi-agent settings; LSTM helps single agents develop sustainable actions but is insufficient to overcome the Tragedy of the Commons in groups. The findings highlight the limits of temporal-awareness alone for cooperative long-horizon goals and motivate future work on higher-level reflective mechanisms or meta-learning to balance short-term rewards with sustainability.

Abstract

The sustainable foraging problem is a dynamic environment testbed for exploring the forms of agent cognition in dealing with social dilemmas in a multi-agent setting. The agents need to resist the temptation of individual rewards through foraging and choose the collective long-term goal of sustainability. We investigate methods of online learning in Neuro-Evolution and Deep Recurrent Q-Networks to enable agents to attempt the problem one-shot as is often required by wicked social problems. We further explore if learning temporal dependencies with Long Short-Term Memory may be able to aid the agents in developing sustainable foraging strategies in the long term. It was found that the integration of Long Short-Term Memory assisted agents in developing sustainable strategies for a single agent, however failed to assist agents in managing the social dilemma that arises in the multi-agent scenario.

Online Learning of Temporal Dependencies for Sustainable Foraging Problem

TL;DR

This work investigates learning in a one-shot, dynamic foraging environment that embodies a social dilemma. It compares online neuro-evolution and Deep Recurrent Q-Networks (DRQNs) as deliberative architectures and augments both with Long Short-Term Memory to capture temporal dependencies. Results show that, while online learning enables one-shot adaptation, both online NE and DRQN tend to converge on greedy policies that deplete resources in multi-agent settings; LSTM helps single agents develop sustainable actions but is insufficient to overcome the Tragedy of the Commons in groups. The findings highlight the limits of temporal-awareness alone for cooperative long-horizon goals and motivate future work on higher-level reflective mechanisms or meta-learning to balance short-term rewards with sustainability.

Abstract

The sustainable foraging problem is a dynamic environment testbed for exploring the forms of agent cognition in dealing with social dilemmas in a multi-agent setting. The agents need to resist the temptation of individual rewards through foraging and choose the collective long-term goal of sustainability. We investigate methods of online learning in Neuro-Evolution and Deep Recurrent Q-Networks to enable agents to attempt the problem one-shot as is often required by wicked social problems. We further explore if learning temporal dependencies with Long Short-Term Memory may be able to aid the agents in developing sustainable foraging strategies in the long term. It was found that the integration of Long Short-Term Memory assisted agents in developing sustainable strategies for a single agent, however failed to assist agents in managing the social dilemma that arises in the multi-agent scenario.
Paper Structure (5 sections, 13 figures)

This paper contains 5 sections, 13 figures.

Figures (13)

  • Figure 1: Mean simulation results across 30 independent runs showing baseline behaviour of moderate and greedy agents
  • Figure 2: Mean simulation results for a single online neuro-evolution agent with a choice of greedy or moderate actions, averaged over 30 independent runs of 1000 time steps each.
  • Figure 3: Mean simulation results for 10 online neuro-evolution agents with a choice of greedy or moderate actions, averaged over 30 independent runs of 1000 time steps each.
  • Figure 4: Mean simulation results for a single DRQN agent with a choice of greedy or moderate actions, averaged over 30 independent runs of 1000 time steps each.
  • Figure 5: Mean simulation results for 10 DRQN agents with a choice of greedy or moderate actions, averaged over 30 independent runs of 1000 time steps each.
  • ...and 8 more figures