Efficient Replay Memory Architectures in Multi-Agent Reinforcement Learning for Traffic Congestion Control

Mukul Chodhary; Kevin Octavian; SooJean Han

Efficient Replay Memory Architectures in Multi-Agent Reinforcement Learning for Traffic Congestion Control

Mukul Chodhary, Kevin Octavian, SooJean Han

TL;DR

This work addresses congestion control in large-scale traffic networks using multi-agent reinforcement learning with memory-efficient exploration. It introduces Dual-Memory Integrated Learning (DMIL), a two-tier memory system with short-term and long-term memories, plus equivalence-class embeddings based on group-equivariance to bound memory growth while preserving learning performance. Theoretical analyses establish that the dual-memory size $msize_{Dual}[t]$ is bounded above by the SARSA replay size, and experiments on grid networks show that DMIL, especially with complex equivalence embeddings and entropy/diffusion rewards, improves congestion metrics and reduces memory growth relative to standard SARSA. The approach is scalable, modular, and demonstrates the value of heterogeneous memory and symmetry-based abstractions for efficient MARL in traffic control scenarios.

Abstract

Episodic control, inspired by the role of episodic memory in the human brain, has been shown to improve the sample inefficiency of model-free reinforcement learning by reusing high-return past experiences. However, the memory growth of episodic control is undesirable in large-scale multi-agent problems such as vehicle traffic management. This paper proposes a novel replay memory architecture called Dual-Memory Integrated Learning, to augment to multi-agent reinforcement learning methods for congestion control via adaptive light signal scheduling. Our dual-memory architecture mimics two core capabilities of human decision-making. First, it relies on diverse types of memory--semantic and episodic, short-term and long-term--in order to remember high-return states that occur often in the network and filter out states that don't. Second, it employs equivalence classes to group together similar state-action pairs and that can be controlled using the same action (i.e., light signal sequence). Theoretical analyses establish memory growth bounds, and simulation experiments on several intersection networks showcase improved congestion performance (e.g., vehicle throughput) from our method.

Efficient Replay Memory Architectures in Multi-Agent Reinforcement Learning for Traffic Congestion Control

TL;DR

is bounded above by the SARSA replay size, and experiments on grid networks show that DMIL, especially with complex equivalence embeddings and entropy/diffusion rewards, improves congestion metrics and reduces memory growth relative to standard SARSA. The approach is scalable, modular, and demonstrates the value of heterogeneous memory and symmetry-based abstractions for efficient MARL in traffic control scenarios.

Abstract

Paper Structure (17 sections, 3 theorems, 4 equations, 5 figures, 2 tables, 3 algorithms)

This paper contains 17 sections, 3 theorems, 4 equations, 5 figures, 2 tables, 3 algorithms.

Introduction
Background
Reinforcement Learning for Network Control
Reinforcement Learning with Memory
Group Equivariance and Invariance
Problem Formulation
Dual-Memory Integrated Learning
Equivalence Classes
The Memory Architecture
Non-Memory vs. Memory-based Agents
Theoretical Analysis
Experiments
Different Architecture Designs
Training and Testing Pipeline
Evaluation Metrics
...and 2 more sections

Key Result

Lemma V.2

Consider the worst-case scenario where all the states encountered until time $t\triangleq nT_{stage}$ are unique (i.e., the number of equivalence classes equals the number of states). Assume $|\mathcal{A}|{\,\geq\,}3\kappa$. Then, in the worst case scenario where memory scales linearly, $msize_Q[t]\

Figures (5)

Figure 1: Two equivalence class types implemented for traffic control.
Figure 2: Architectures for STM (top) and LTM (bottom).
Figure 3: Agent architectures for [top] SARSA (i.e., non-memory) and [bottom] memory-based.
Figure 4: Number of vehicles reaching destination ($V_c$) for the $5\times 5$ grid network. [Left] A comparison of the performance among different reward functions. [Right] A comparison among different memory architecture types, with entropy reward fixed.
Figure 5: Memory table growth vs. time for the $5\times 5$ network, averaged over all intersections. For SARSA (non-memory), the number of Q-table entries are shown instead.

Theorems & Definitions (7)

Remark III.1: Q-Value Function Design
Definition V.1
Lemma V.2: Worst-Case Trajectory
Lemma V.3: Best-Case Trajectory
Lemma V.4: Comparison Ratio of Dual-Memory and SARSA
Remark V.5: Optimal Hyperparameter Design
Remark VI.1

Efficient Replay Memory Architectures in Multi-Agent Reinforcement Learning for Traffic Congestion Control

TL;DR

Abstract

Efficient Replay Memory Architectures in Multi-Agent Reinforcement Learning for Traffic Congestion Control

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (7)