Automatic Generation of High-Performance RL Environments

Seth Karten; Rahul Dev Appapogu; Chi Jin

Automatic Generation of High-Performance RL Environments

Seth Karten, Rahul Dev Appapogu, Chi Jin

Abstract

Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a reusable recipe - a generic prompt template, hierarchical verification, and iterative agent-assisted repair - that produces semantically equivalent high-performance environments for <$10 in compute cost. We demonstrate three distinct workflows across five environments. Direct translation (no prior performance implementation exists): EmuRust (1.5x PPO speedup via Rust parallelism for a Game Boy emulator) and PokeJAX, the first GPU-parallel Pokemon battle simulator (500M SPS random action, 15.2M SPS PPO; 22,320x over the TypeScript reference). Translation verified against existing performance implementations: throughput parity with MJX (1.04x) and 5x over Brax at matched GPU batch sizes (HalfCheetah JAX); 42x PPO (Puffer Pong). New environment creation: TCGJax, the first deployable JAX Pokemon TCG engine (717K SPS random action, 153K SPS PPO; 6.6x over the Python reference), synthesized from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Hierarchical verification (property, interaction, and rollout tests) confirms semantic equivalence for all five environments; cross-backend policy transfer confirms zero sim-to-sim gap for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns. The paper contains sufficient detail - including representative prompts, verification methodology, and complete results - that a coding agent could reproduce the translations directly from the manuscript.

Automatic Generation of High-Performance RL Environments

Abstract

Paper Structure (85 sections, 9 figures, 11 tables, 1 algorithm)

This paper contains 85 sections, 9 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Hardware-accelerated environments.
High-throughput RL systems.
LLM-assisted code generation.
Scaling RL.
Translation Recipe
Problem Statement
Hierarchical Verification
Agent-Assisted Translation Process
Experiments
Throughput Results
Training Time Breakdown
Policy Equivalence
Cross-backend policy transfer (L4).
...and 70 more sections

Figures (9)

Figure 1: Performance environments eliminate the environment bottleneck. (Top) Our methodology shifts training from environment-bound to model-bound. (Bottom) Five case studies, grouped by result type. Direct translation into newly performant environments (no prior performance implementation): EmuRust ($1.5\times$ CPU-to-CPU PPO); PokeJAX---the first GPU-parallel Pokemon battle simulator, 500M SPS at 65K batch. Translation verified against existing performance implementations: throughput parity with MJX ($1.04\times$) and $5\times$ over Brax at matched batch (HalfCheetah); $42\times$ end-to-end PPO over expert-optimized C (Pong). New environment creation: TCGJax---the first deployable JAX Pokemon card-game engine, 717K SPS, created from a web-extracted specification. All produced for <$10 in agent compute.
Figure 2: Translation and verification pipeline. A reference environment is decomposed into modules, translated by a coding agent, and verified through four levels of increasing scope. Failures at any level trigger targeted repair and re-verification; Level 4 cross-backend policy transfer closes the outer loop.
Figure 3: PPO training time breakdown across model scales. Three bars per implementation show 2M, 20M, 200M parameter models. Performance implementations drop to ${\leq}4\%$ env overhead at 200M. All on 1$\times$ RTX 5090.
Figure 4: Policy equivalence. Pong (10 seeds), HalfCheetah (10 seeds), EmuRust (10 seeds): matched reward curves across backends. TCGJax and PokeJAX: matched Elo curves (JAX vs reference). All five environments achieve L4 cross-backend transfer (Table \ref{['tab:cross_transfer']}).
Figure 5: Throughput scaling. EmuRust (left) saturates at 128 CPU envs. PokeJAX (center) scales linearly with GPU batch size. TCG Pocket (right): Python peaks at 16 processes; JAX scales with batch size.
...and 4 more figures

Automatic Generation of High-Performance RL Environments

Abstract

Automatic Generation of High-Performance RL Environments

Authors

Abstract

Table of Contents

Figures (9)