Table of Contents
Fetching ...

RLBenchNet: The Right Network for the Right Reinforcement Learning Task

Ivan Smirnov, Shangding Gu

TL;DR

This paper systematically benchmarks seven neural architectures (MLP, LSTM, GRU, Transformer-XL, GTrXL, Mamba, Mamba-2) within PPO across diverse RL tasks, including MuJoCo, Atari, MiniGrid, and classic control, to reveal architecture-specific strengths and trade-offs. It finds that MLPs excel in fully observable continuous control, while Mamba-2 offers a compelling balance of fast training and memory capacity for sequence-rich tasks; Transformer-based models are essential for long-horizon memory, albeit at higher compute and memory costs. The results yield concrete guidelines for architecture selection based on task memory requirements and computational constraints, showing that simpler, efficient models often outperform more complex ones when the task dynamics do not demand extensive memory. These insights have practical impact for researchers and practitioners aiming to optimize RL systems under resource limitations, with code available at the linked repository.

Abstract

Reinforcement learning (RL) has seen significant advancements through the application of various neural network architectures. In this study, we systematically investigate the performance of several neural networks in RL tasks, including Long Short-Term Memory (LSTM), Multi-Layer Perceptron (MLP), Mamba/Mamba-2, Transformer-XL, Gated Transformer-XL, and Gated Recurrent Unit (GRU). Through comprehensive evaluation across continuous control, discrete decision-making, and memory-based environments, we identify architecture-specific strengths and limitations. Our results reveal that: (1) MLPs excel in fully observable continuous control tasks, providing an optimal balance of performance and efficiency; (2) recurrent architectures like LSTM and GRU offer robust performance in partially observable environments with moderate memory requirements; (3) Mamba models achieve a 4.5x higher throughput compared to LSTM and a 3.9x increase over GRU, all while maintaining comparable performance; and (4) only Transformer-XL, Gated Transformer-XL, and Mamba-2 successfully solve the most challenging memory-intensive tasks, with Mamba-2 requiring 8x less memory than Transformer-XL. These findings provide insights for researchers and practitioners, enabling more informed architecture selection based on specific task characteristics and computational constraints. Code is available at: https://github.com/SafeRL-Lab/RLBenchNet

RLBenchNet: The Right Network for the Right Reinforcement Learning Task

TL;DR

This paper systematically benchmarks seven neural architectures (MLP, LSTM, GRU, Transformer-XL, GTrXL, Mamba, Mamba-2) within PPO across diverse RL tasks, including MuJoCo, Atari, MiniGrid, and classic control, to reveal architecture-specific strengths and trade-offs. It finds that MLPs excel in fully observable continuous control, while Mamba-2 offers a compelling balance of fast training and memory capacity for sequence-rich tasks; Transformer-based models are essential for long-horizon memory, albeit at higher compute and memory costs. The results yield concrete guidelines for architecture selection based on task memory requirements and computational constraints, showing that simpler, efficient models often outperform more complex ones when the task dynamics do not demand extensive memory. These insights have practical impact for researchers and practitioners aiming to optimize RL systems under resource limitations, with code available at the linked repository.

Abstract

Reinforcement learning (RL) has seen significant advancements through the application of various neural network architectures. In this study, we systematically investigate the performance of several neural networks in RL tasks, including Long Short-Term Memory (LSTM), Multi-Layer Perceptron (MLP), Mamba/Mamba-2, Transformer-XL, Gated Transformer-XL, and Gated Recurrent Unit (GRU). Through comprehensive evaluation across continuous control, discrete decision-making, and memory-based environments, we identify architecture-specific strengths and limitations. Our results reveal that: (1) MLPs excel in fully observable continuous control tasks, providing an optimal balance of performance and efficiency; (2) recurrent architectures like LSTM and GRU offer robust performance in partially observable environments with moderate memory requirements; (3) Mamba models achieve a 4.5x higher throughput compared to LSTM and a 3.9x increase over GRU, all while maintaining comparable performance; and (4) only Transformer-XL, Gated Transformer-XL, and Mamba-2 successfully solve the most challenging memory-intensive tasks, with Mamba-2 requiring 8x less memory than Transformer-XL. These findings provide insights for researchers and practitioners, enabling more informed architecture selection based on specific task characteristics and computational constraints. Code is available at: https://github.com/SafeRL-Lab/RLBenchNet

Paper Structure

This paper contains 20 sections, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Average returns for MuJoCo tasks. MLP and LSTM demonstrate competitive or superior performance in Walker2d and HalfCheetah, while GRU and Transformer-XL perform best in Hopper.
  • Figure 2: Average returns across random seeds for Atari environments. Mamba and MLP with frame stacking excel in Pong, while LSTM and MLP with frame stacking perform best in Breakout.
  • Figure 3: Average returns across seeds for masked classic control tasks. Recurrent architectures and stacked MLPs excel in CartPole, while Transformer-XL performs best in LunarLander.
  • Figure 4: Average returns across random seeds for MiniGrid environments. In DoorKey-8x8, original Mamba shows the fastest convergence, while in Memory-S11, only Transformer-XL and Mamba-2 achieve meaningful learning, with Mamba-2 reaching near-optimal performance.