Table of Contents
Fetching ...

Multi-Agent Transfer Learning via Temporal Contrastive Learning

Weihao Zeng, Joseph Campbell, Simon Stepputtis, Katia Sycara

TL;DR

This work tackles sample-inefficient transfer learning in multi-agent reinforcement learning by combining goal-conditioned policies with unsupervised temporal abstraction. The method pre-trains a GCRL agent on a source environment, finetunes it on a target domain, and learns a temporal latent space via contrastive learning to build a planning graph whose nodes are latent clusters and edges are observed transitions. Sub-goals derived from the graph guide execution in the target domain, enabling efficient planning and improved interpretability. In Overcooked experiments, the approach achieves similar or better performance with only about 21.7% of the training data required by baselines, demonstrating strong gains in sample efficiency and the ability to handle sparse rewards and long-horizon tasks.

Abstract

This paper introduces a novel transfer learning framework for deep multi-agent reinforcement learning. The approach automatically combines goal-conditioned policies with temporal contrastive learning to discover meaningful sub-goals. The approach involves pre-training a goal-conditioned agent, finetuning it on the target domain, and using contrastive learning to construct a planning graph that guides the agent via sub-goals. Experiments on multi-agent coordination Overcooked tasks demonstrate improved sample efficiency, the ability to solve sparse-reward and long-horizon problems, and enhanced interpretability compared to baselines. The results highlight the effectiveness of integrating goal-conditioned policies with unsupervised temporal abstraction learning for complex multi-agent transfer learning. Compared to state-of-the-art baselines, our method achieves the same or better performances while requiring only 21.7% of the training samples.

Multi-Agent Transfer Learning via Temporal Contrastive Learning

TL;DR

This work tackles sample-inefficient transfer learning in multi-agent reinforcement learning by combining goal-conditioned policies with unsupervised temporal abstraction. The method pre-trains a GCRL agent on a source environment, finetunes it on a target domain, and learns a temporal latent space via contrastive learning to build a planning graph whose nodes are latent clusters and edges are observed transitions. Sub-goals derived from the graph guide execution in the target domain, enabling efficient planning and improved interpretability. In Overcooked experiments, the approach achieves similar or better performance with only about 21.7% of the training data required by baselines, demonstrating strong gains in sample efficiency and the ability to handle sparse rewards and long-horizon tasks.

Abstract

This paper introduces a novel transfer learning framework for deep multi-agent reinforcement learning. The approach automatically combines goal-conditioned policies with temporal contrastive learning to discover meaningful sub-goals. The approach involves pre-training a goal-conditioned agent, finetuning it on the target domain, and using contrastive learning to construct a planning graph that guides the agent via sub-goals. Experiments on multi-agent coordination Overcooked tasks demonstrate improved sample efficiency, the ability to solve sparse-reward and long-horizon problems, and enhanced interpretability compared to baselines. The results highlight the effectiveness of integrating goal-conditioned policies with unsupervised temporal abstraction learning for complex multi-agent transfer learning. Compared to state-of-the-art baselines, our method achieves the same or better performances while requiring only 21.7% of the training samples.
Paper Structure (11 sections, 1 equation, 6 figures, 2 tables, 2 algorithms)

This paper contains 11 sections, 1 equation, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: Our method follows three steps: 1) pre-train the GCRL agent to acquire diverse transferable skills by achieving short-horizon goals in the source environment; 2) finetune the GCRL agent on the target environment, learn a latent space to encapsulate the temporal structure of trajectories form rolling out the finetuned GCRL agent, and construct a planning graph, whose nodes are clusters from the latent space and edges are transitions between clusters observed in the expert trajectory; 3) and, execute task in the target environment by following sub-goals. We assume a single successful demonstration in the target environment is given, which we utilize to guide agent finetuning and graph construction.
  • Figure 2: The source and target Overcooked carroll2019utility tasks. The two chefs need to coordinate to make soup and deliver soups. In each environment, there are two chefs (the chef with the green hat and the chef with the blue hat), onion dispensers, plate dispensers, ovens (the grey box with a black top), a serving area (the plain light grey box), walls (brown box) and optionally cilantro dispensers.
  • Figure 3: Overcooked recipes. To make one soup, the two chefs need to 1) fetch three onions from the onion dispenser and put them into the oven one by one, and 2) turn on the oven and wait for 20 steps, and 3) fetch a plate from the plate dispenser, take the soup from the oven to the plate, and 4) Optionally, to make a cilantro soup, fetch Cilantro from the dispenser and put it on the soup plate.
  • Figure 4: Overcooked Learning Curves. Average soups delivered over 50 episodes throughout training. Most baselines in small corridor and corridor do not deliver any soups, thus overlapping flat lines.
  • Figure 5: The scatter plot for normalized performance and sample efficiency in the Overcooked environment. The maximum number of soups delivered is normalized using the formula: maximum number of soups delivered for a given method / maximum number of soups delivered for all methods in an environment. The Sample efficiency is normalized using the formula: 1 - (steps to convergence for a given method / maximum steps to convergence in the environment). Steps to convergence are determined by the steps at which a method reaches 90% of its maximum performance. Variants of the same method are grouped under a single plot category.
  • ...and 1 more figures