Table of Contents
Fetching ...

Highly Parallelized Reinforcement Learning Training with Relaxed Assignment Dependencies

Zhouyu He, Peng Qiao, Rongchun Li, Yong Dou, Yusong Tan

TL;DR

The paper addresses the high computational cost of training deep RL agents by proposing TianJi, a distributed system that relaxes inter-subtask assignment dependencies and uses event-driven asynchronous communication. It combines decentralized, data-driven training with a production–consumption balance strategy to manage sample staleness and preserve convergence. Empirical results show substantial improvements: up to 4.37× convergence-time speedups, 7.13× throughput gains, and near hardware-limited data transmission, with strong performance for both on-policy PPO and off-policy methods. The work demonstrates scalable acceleration across multiple nodes and provides a practical direction for improving DRL training efficiency in real systems, with code available at the project repository.

Abstract

As the demands for superior agents grow, the training complexity of Deep Reinforcement Learning (DRL) becomes higher. Thus, accelerating training of DRL has become a major research focus. Dividing the DRL training process into subtasks and using parallel computation can effectively reduce training costs. However, current DRL training systems lack sufficient parallelization due to data assignment between subtask components. This assignment issue has been ignored, but addressing it can further boost training efficiency. Therefore, we propose a high-throughput distributed RL training system called TianJi. It relaxes assignment dependencies between subtask components and enables event-driven asynchronous communication. Meanwhile, TianJi maintains clear boundaries between subtask components. To address convergence uncertainty from relaxed assignment dependencies, TianJi proposes a distributed strategy based on the balance of sample production and consumption. The strategy controls the staleness of samples to correct their quality, ensuring convergence. We conducted extensive experiments. TianJi achieves a convergence time acceleration ratio of up to 4.37 compared to related comparison systems. When scaled to eight computational nodes, TianJi shows a convergence time speedup of 1.6 and a throughput speedup of 7.13 relative to XingTian, demonstrating its capability to accelerate training and scalability. In data transmission efficiency experiments, TianJi significantly outperforms other systems, approaching hardware limits. TianJi also shows effectiveness in on-policy algorithms, achieving convergence time acceleration ratios of 4.36 and 2.95 compared to RLlib and XingTian. TianJi is accessible at https://github.com/HiPRL/TianJi.git.

Highly Parallelized Reinforcement Learning Training with Relaxed Assignment Dependencies

TL;DR

The paper addresses the high computational cost of training deep RL agents by proposing TianJi, a distributed system that relaxes inter-subtask assignment dependencies and uses event-driven asynchronous communication. It combines decentralized, data-driven training with a production–consumption balance strategy to manage sample staleness and preserve convergence. Empirical results show substantial improvements: up to 4.37× convergence-time speedups, 7.13× throughput gains, and near hardware-limited data transmission, with strong performance for both on-policy PPO and off-policy methods. The work demonstrates scalable acceleration across multiple nodes and provides a practical direction for improving DRL training efficiency in real systems, with code available at the project repository.

Abstract

As the demands for superior agents grow, the training complexity of Deep Reinforcement Learning (DRL) becomes higher. Thus, accelerating training of DRL has become a major research focus. Dividing the DRL training process into subtasks and using parallel computation can effectively reduce training costs. However, current DRL training systems lack sufficient parallelization due to data assignment between subtask components. This assignment issue has been ignored, but addressing it can further boost training efficiency. Therefore, we propose a high-throughput distributed RL training system called TianJi. It relaxes assignment dependencies between subtask components and enables event-driven asynchronous communication. Meanwhile, TianJi maintains clear boundaries between subtask components. To address convergence uncertainty from relaxed assignment dependencies, TianJi proposes a distributed strategy based on the balance of sample production and consumption. The strategy controls the staleness of samples to correct their quality, ensuring convergence. We conducted extensive experiments. TianJi achieves a convergence time acceleration ratio of up to 4.37 compared to related comparison systems. When scaled to eight computational nodes, TianJi shows a convergence time speedup of 1.6 and a throughput speedup of 7.13 relative to XingTian, demonstrating its capability to accelerate training and scalability. In data transmission efficiency experiments, TianJi significantly outperforms other systems, approaching hardware limits. TianJi also shows effectiveness in on-policy algorithms, achieving convergence time acceleration ratios of 4.36 and 2.95 compared to RLlib and XingTian. TianJi is accessible at https://github.com/HiPRL/TianJi.git.

Paper Structure

This paper contains 29 sections, 3 equations, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: The spatiotemporal diagram of typical DRL systems during training. The annotations match those in Algorithm 1, with ➃ representing the computation of Prioritized Experience Replay (PER). W denotes a worker, A an actor, B a buffer, and L a learner. The A2-L1 represents a component comprising two actors and one learner. This diagram illustrates the abstraction, assignment dependencies, communication patterns, execution sequence, and resource utilization in systems.
  • Figure 2: Data-driven training flowchart with relaxed assignment dependencies.
  • Figure 3: As the number of actors increases, the critical path shifts. Critical path is the sequence of tasks that determines the minimum time required to complete the computation.
  • Figure 4: TianJi outperforms the baselines on learning performance and computational efficiency.
  • Figure 5: Comparison of data transfer efficiency between TianJi and XingTian. ST is "sample time".
  • ...and 7 more figures