Mitigating the Stability-Plasticity Dilemma in Adaptive Train Scheduling with Curriculum-Driven Continual DQN Expansion

Achref Jaziri; Etienne Künzel; Visvanathan Ramesh

Mitigating the Stability-Plasticity Dilemma in Adaptive Train Scheduling with Curriculum-Driven Continual DQN Expansion

Achref Jaziri, Etienne Künzel, Visvanathan Ramesh

TL;DR

The paper tackles the stability-plasticity dilemma in continual reinforcement learning for adaptive train scheduling by combining a curriculum-driven design with Continual DQN Expansion (CDE). CDE dynamically constructs and prunes Q-function subspaces, leveraging Elastic Weight Consolidation to preserve past tasks and Padé Activation Units to boost plasticity, enabling rapid adaptation to non-stationary environments. A structured curriculum decomposes the problem into interrelated skills (pathfinding, malfunction handling, deadlock avoidance) and a final evaluation, which, together with subspace expansion, yields superior learning efficiency and generalization in Flatland-based experiments. The work demonstrates a scalable approach for maintaining performance across evolving tasks in a complex multi-agent domain, with implications for robust, domain-specific continual learning in transportation optimization.

Abstract

A continual learning agent builds on previous experiences to develop increasingly complex behaviors by adapting to non-stationary and dynamic environments while preserving previously acquired knowledge. However, scaling these systems presents significant challenges, particularly in balancing the preservation of previous policies with the adaptation of new ones to current environments. This balance, known as the stability-plasticity dilemma, is especially pronounced in complex multi-agent domains such as the train scheduling problem, where environmental and agent behaviors are constantly changing, and the search space is vast. In this work, we propose addressing these challenges in the train scheduling problem using curriculum learning. We design a curriculum with adjacent skills that build on each other to improve generalization performance. Introducing a curriculum with distinct tasks introduces non-stationarity, which we address by proposing a new algorithm: Continual Deep Q-Network (DQN) Expansion (CDE). Our approach dynamically generates and adjusts Q-function subspaces to handle environmental changes and task requirements. CDE mitigates catastrophic forgetting through EWC while ensuring high plasticity using adaptive rational activation functions. Experimental results demonstrate significant improvements in learning efficiency and adaptability compared to RL baselines and other adapted methods for continual learning, highlighting the potential of our method in managing the stability-plasticity dilemma in the adaptive train scheduling setting.

Mitigating the Stability-Plasticity Dilemma in Adaptive Train Scheduling with Curriculum-Driven Continual DQN Expansion

TL;DR

Abstract

Paper Structure (19 sections, 4 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 4 equations, 8 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Train Scheduling Problem
Mathematical Formulation
Flatland Simulator
Curriculum Design for the Train Scheduling Problem
Continual DQN Expansion Algorithm
Formalisation of the Continual Reinforcement Learning Problem
Elastic Weight Consolidation
Padé Activation Units
Subspace Expansion of Q-Functions
Experiments
Training Curricula
Experimental Setting
Baselines and Hyperparamters
...and 4 more sections

Figures (8)

Figure 1: Diagram illustrating the Continual DQN Expansion Algorithm. When a new task or a change in the environment occurs, a new subspace with anchor $\theta$ is initialized. The new Q-function is trained with adaptive rational activation functions ($\theta^{HP}_i$) to enhance plasticity and achieve faster convergence on the new task. Simultaneously, the previous subspaces are further adapted on the new task by training Q-functions ($\theta^{CF}_i$) using elastic weight consolidation (EWC) regularization to mitigate catastrophic forgetting on previous tasks. Depending on the performance of different subspaces, the subspace set is either extended or pruned.
Figure 2: Illustration of the curriculum designed for TSP through distinct environments. (a) Pathfinding Environment: A grid where a single agent navigates from start to destination, emphasizing efficient route planning. (b) Malfunctions and Train Speeds Environment: A setting where agents deal with random malfunctions and varying train speeds, requiring adaptive responses and strategic planning to minimize delays. (c) Deadlocks Environment: Scenarios focusing on common deadlock situations, gradually increasing in complexity with more agents and fewer switches, to train agents in avoiding gridlock. (d) Full Task Evaluation Environment: An integrated setting combining pathfinding, malfunctions, train speeds, and deadlock scenarios to evaluate the agents. The curriculum environments (a-c) are available in multiple sizes to provide diverse training scenarios, as detailed in the appendix. This curriculum structure improves learning efficiency and performance on the full task evaluation environment (d).
Figure 3: Comparison of performance metrics for different CDE expansion strategies. The plots show the mean score (left) and completion rate (right) as a function of the number of expansions. The methods are abbreviated as follows: CDE-BN (subtask expansion with best network selection), CDE-MV (subtask expansion with majority vote), and CDE-STAN (standard task expansion).
Figure 4: Multiple graphs in a single figure. The first row contains the first two graphs, while the second row contains the last two graphs. Each graph represents different completion rates and scores obtained using various configurations of the DQN algorithm.
Figure 5: Task ordering impact analysis for DQN baseline across different curricula: PMD, MPD, and MDP (from left to right, top to bottom). Here, P stands for Pathfinding, M for Malfunctions with varying train speeds, and D for Deadlocks. The dotted purple line represents performance on the final test environment, while the orange, red, and pink lines correspond to the Pathfinding, Malfunction, and Deadlock environments, respectively. Vertical black lines indicate transitions to new tasks. The results highlight significant fluctuations in DQN's performance on the test environment depending on the task sequence. A noticeable performance drop is observed when DQN transitions to a new task, suggesting some degree of catastrophic forgetting. Furthermore, DQN appears to overfit the Pathfinding environment, which compromises its generalization capability.
...and 3 more figures

Mitigating the Stability-Plasticity Dilemma in Adaptive Train Scheduling with Curriculum-Driven Continual DQN Expansion

TL;DR

Abstract

Mitigating the Stability-Plasticity Dilemma in Adaptive Train Scheduling with Curriculum-Driven Continual DQN Expansion

Authors

TL;DR

Abstract

Table of Contents

Figures (8)