Table of Contents
Fetching ...

Trajectory-Class-Aware Multi-Agent Reinforcement Learning

Hyungho Na, Kwanghyeon Lee, Sumin Lee, Il-Chul Moon

TL;DR

TRAMA tackles the generalization gap in multi-task MARL by enabling agents to infer and exploit trajectory-type information during execution. It constructs a discretized trajectory embedding space via a modified VQ-VAE with a trajectory-class-aware coverage loss, clusters trajectories to identify classes, and learns a trajectory-class-aware policy that conditions actions on the predicted class. A trajectory-class predictor operates on partial observations, while a trajectory-class representation model provides class-conditioned features to the action policy, enabling task-aware decisions across diverse tasks. Empirical results on SMACv2-based multi-task settings and standard benchmarks show that TRAMA improves learning efficiency and task performance, including in out-of-distribution scenarios, by leveraging unsupervised trajectory clustering and trajectory-class conditioning.

Abstract

In the context of multi-agent reinforcement learning, generalization is a challenge to solve various tasks that may require different joint policies or coordination without relying on policies specialized for each task. We refer to this type of problem as a multi-task, and we train agents to be versatile in this multi-task setting through a single training process. To address this challenge, we introduce TRajectory-class-Aware Multi-Agent reinforcement learning (TRAMA). In TRAMA, agents recognize a task type by identifying the class of trajectories they are experiencing through partial observations, and the agents use this trajectory awareness or prediction as additional information for action policy. To this end, we introduce three primary objectives in TRAMA: (a) constructing a quantized latent space to generate trajectory embeddings that reflect key similarities among them; (b) conducting trajectory clustering using these trajectory embeddings; and (c) building a trajectory-class-aware policy. Specifically for (c), we introduce a trajectory-class predictor that performs agent-wise predictions on the trajectory class; and we design a trajectory-class representation model for each trajectory class. Each agent takes actions based on this trajectory-class representation along with its partial observation for task-aware execution. The proposed method is evaluated on various tasks, including multi-task problems built upon StarCraft II. Empirical results show further performance improvements over state-of-the-art baselines.

Trajectory-Class-Aware Multi-Agent Reinforcement Learning

TL;DR

TRAMA tackles the generalization gap in multi-task MARL by enabling agents to infer and exploit trajectory-type information during execution. It constructs a discretized trajectory embedding space via a modified VQ-VAE with a trajectory-class-aware coverage loss, clusters trajectories to identify classes, and learns a trajectory-class-aware policy that conditions actions on the predicted class. A trajectory-class predictor operates on partial observations, while a trajectory-class representation model provides class-conditioned features to the action policy, enabling task-aware decisions across diverse tasks. Empirical results on SMACv2-based multi-task settings and standard benchmarks show that TRAMA improves learning efficiency and task performance, including in out-of-distribution scenarios, by leveraging unsupervised trajectory clustering and trajectory-class conditioning.

Abstract

In the context of multi-agent reinforcement learning, generalization is a challenge to solve various tasks that may require different joint policies or coordination without relying on policies specialized for each task. We refer to this type of problem as a multi-task, and we train agents to be versatile in this multi-task setting through a single training process. To address this challenge, we introduce TRajectory-class-Aware Multi-Agent reinforcement learning (TRAMA). In TRAMA, agents recognize a task type by identifying the class of trajectories they are experiencing through partial observations, and the agents use this trajectory awareness or prediction as additional information for action policy. To this end, we introduce three primary objectives in TRAMA: (a) constructing a quantized latent space to generate trajectory embeddings that reflect key similarities among them; (b) conducting trajectory clustering using these trajectory embeddings; and (c) building a trajectory-class-aware policy. Specifically for (c), we introduce a trajectory-class predictor that performs agent-wise predictions on the trajectory class; and we design a trajectory-class representation model for each trajectory class. Each agent takes actions based on this trajectory-class representation along with its partial observation for task-aware execution. The proposed method is evaluated on various tasks, including multi-task problems built upon StarCraft II. Empirical results show further performance improvements over state-of-the-art baselines.

Paper Structure

This paper contains 37 sections, 14 equations, 31 figures, 12 tables, 2 algorithms.

Figures (31)

  • Figure 1: Illustration of the overall procedure for trajectory-class-aware policy learning: (a) through trajectory clustering, each trajectory is labeled. (b) Each agent predicts which trajectory class it is experiencing based on its partial observation. (c) After identifying the trajectory class, agents perform trajectory-class-dependent decision-making. In (c), each agent succeeds in identifying the same trajectory class based on its partial observations denoted with different colors.
  • Figure 2: State Diagram of multi-task setting
  • Figure 3: Overview of TRAMA framework. The purple dashed line represents a gradient flow.
  • Figure 4: PCA of sampled embedding $x \in {\mathcal{D}}$. Colors from red to purple (rainbow) represent early to late timestep in (a) and (b). (a) and (b) are the results of sample multi-tasks with three different unit combinations with various initial positions. (c) and (d) are the clustering results of (a) and (b), respectively, and each color (red, green, and blue) represents each class.
  • Figure 5: Preserved Labels
  • ...and 26 more figures

Theorems & Definitions (1)

  • Definition 2.1