Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research

Tian Lan; Huan Wang; Caiming Xiong; Silvio Savarese

Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research

Tian Lan, Huan Wang, Caiming Xiong, Silvio Savarese

TL;DR

WarpSci presents a GPU-centered, domain-agnostic framework that eliminates CPU–GPU data transfers and enables thousands of concurrent RL simulations, addressing data-throughput bottlenecks in data-driven science. Built atop a unified in-place GPU data store and CUDA backend, WarpSci runs the entire RL workflow on GPUs and provides Python interfaces to streamline environment construction and interaction. Across classic control, multi-agent economics, and catalytic reaction problems, WarpSci achieves 10–100× throughput improvements and near-linear scaling, with faster and more stable convergence as data throughput increases. The framework demonstrates substantial practical impact for speeding up scientific RL studies, enabling domain-spanning experiments such as hydrogenation pathway exploration in catalysis and Haber–Bosch process optimization, using high-throughput, environment-agnostic simulations.

Abstract

We introduce WarpSci, a domain agnostic framework designed to overcome crucial system bottlenecks encountered in the application of reinforcement learning to intricate environments with vast datasets featuring high-dimensional observation or action spaces. Notably, our framework eliminates the need for data transfer between the CPU and GPU, enabling the concurrent execution of thousands of simulations on a single or multiple GPUs. This high data throughput architecture proves particularly advantageous for data-driven scientific research, where intricate environment models are commonly essential.

Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research

TL;DR

Abstract

Paper Structure (9 sections, 4 figures)

This paper contains 9 sections, 4 figures.

Introduction
Contribution
Examples
Scalable Reinforcement Learning
Details of Architecture
Example Details
Classic Control.
Multi-Agent Economics.
Catalytic Reactions.

Figures (4)

Figure 1: A flow chart depicting WarpSci. Computations within this framework are organized into GPU blocks, each comprising multiple threads to facilitate concurrent environment roll-outs. Each thread is responsible for operating an agent that samples actions and computes rewards. These blocks have access to the global GPU memory, which houses the RL environment (depicted as a 3-D grid in a green-bordered box) with local variations, and deep policy models. Additionally, they store in-place roll-out data for training purposes. The dashed brown boxes represent references (not copies) of the policy model objects and data placeholders managed by blocks and hosted in the global memory. Users have the flexibility to compose and upload their custom environment setups to finalize the environment construction.
Figure 2: Scalability, convergence and learning speed for WarpSci applied to gym classic control environments. (a) Roll-out and training throughput in Cartpole-v1 and Acrobot-v1 versus the number of parallel environments (log-log scale) to 10K concurrent environments with random local initialization: the throughput scales linearly. The average episodic reward (the accumulated total reward collected from the start to the terminal state) versus the training time (wall-clock minutes) for (b) Cartpole-v1 and (c) Acrobot-v1 running at various concurrency levels. The model was trained on a single Nvidia A100 GPU. For robustness, the depicted results are averaging over eight independent runs from scratch with different initialization seeds and the same hyperparameters. The shadow regions represent the error bar of eight independent runs.
Figure 3: WarpSci performance in the COVID-19 economic simulation in log scale. Left: Note that there is no data transfer with WarpSci. With 60 parallel environments, WarpSci achieves 24 times higher throughput over CPU-based distributed training architectures (“total”). Moreover, both the roll-out and training phase are an order of magnitude faster than on the distributed N1 node. Right: Environment steps per second and end-to-end training speed scale almost linearly with the number of environments. (Credit: lan2021warpdrive)
Figure 4: Convergence and learning speed, quantified by total runtime in wall-clock minutes, were assessed for Langmuir-Hinshelwood (a, b) and Eley-Rideal (c, d) hydrogenation reactions of NH$_2$ to NH$_3$. Varied numbers of concurrent environment instances were employed: 4 in red, 20 in green, 100 in blue, and 500 in yellow. The episodic reward denotes the mean accumulated reward that H atom actors gather from the initial to the terminal state of (a) Langmuir-Hinshelwood and (c) Eley-Rideal. Episodic step indicates the average total steps to reach the terminal state for (b) Langmuir-Hinshelwood and (d) Eley-Rideal. Training utilized a single Nvidia A100 GPU. For robustness, the displayed results are averages over five independent runs from scratch with different initialization seeds and identical hyperparameters. Shadow regions depict the error bars of the five independent runs. (Credit: lan2024)

Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research

TL;DR

Abstract

Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research

Authors

TL;DR

Abstract

Table of Contents

Figures (4)