Table of Contents
Fetching ...

Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning

Egor Cherepanov, Nikita Kachaev, Alexey K. Kovalev, Aleksandr I. Panov

TL;DR

This work addresses the lack of a universal memory benchmark for reinforcement learning by introducing MIKASA, a two-part benchmark suite consisting of MIKASA-Base (a unified, Gymnasium-based collection of memory tasks) and MIKASA-Robo (32 memory-intensive robotic manipulation tasks). It formalizes a four-way memory taxonomy, provides memory-focused datasets for offline RL, and evaluates online, offline, and VLA baselines to reveal current limitations in memory-enabled agents. The results show that even memory-augmented architectures struggle as memory demands increase, underscoring the need for specialized memory mechanisms in realistic robotic tasks. By offering installable tooling, standardized evaluation, and rich datasets, MIKASA aims to accelerate the development of robust memory-aware RL systems for real-world applications.

Abstract

Memory is crucial for enabling agents to tackle complex tasks with temporal and spatial dependencies. While many reinforcement learning (RL) algorithms incorporate memory, the field lacks a universal benchmark to assess an agent's memory capabilities across diverse scenarios. This gap is particularly evident in tabletop robotic manipulation, where memory is essential for solving tasks with partial observability and ensuring robust performance, yet no standardized benchmarks exist. To address this, we introduce MIKASA (Memory-Intensive Skills Assessment Suite for Agents), a comprehensive benchmark for memory RL, with three key contributions: (1) we propose a comprehensive classification framework for memory-intensive RL tasks, (2) we collect MIKASA-Base -- a unified benchmark that enables systematic evaluation of memory-enhanced agents across diverse scenarios, and (3) we develop MIKASA-Robo (pip install mikasa-robo-suite) -- a novel benchmark of 32 carefully designed memory-intensive tasks that assess memory capabilities in tabletop robotic manipulation. Our work introduces a unified framework to advance memory RL research, enabling more robust systems for real-world use. MIKASA is available at https://tinyurl.com/membenchrobots.

Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning

TL;DR

This work addresses the lack of a universal memory benchmark for reinforcement learning by introducing MIKASA, a two-part benchmark suite consisting of MIKASA-Base (a unified, Gymnasium-based collection of memory tasks) and MIKASA-Robo (32 memory-intensive robotic manipulation tasks). It formalizes a four-way memory taxonomy, provides memory-focused datasets for offline RL, and evaluates online, offline, and VLA baselines to reveal current limitations in memory-enabled agents. The results show that even memory-augmented architectures struggle as memory demands increase, underscoring the need for specialized memory mechanisms in realistic robotic tasks. By offering installable tooling, standardized evaluation, and rich datasets, MIKASA aims to accelerate the development of robust memory-aware RL systems for real-world applications.

Abstract

Memory is crucial for enabling agents to tackle complex tasks with temporal and spatial dependencies. While many reinforcement learning (RL) algorithms incorporate memory, the field lacks a universal benchmark to assess an agent's memory capabilities across diverse scenarios. This gap is particularly evident in tabletop robotic manipulation, where memory is essential for solving tasks with partial observability and ensuring robust performance, yet no standardized benchmarks exist. To address this, we introduce MIKASA (Memory-Intensive Skills Assessment Suite for Agents), a comprehensive benchmark for memory RL, with three key contributions: (1) we propose a comprehensive classification framework for memory-intensive RL tasks, (2) we collect MIKASA-Base -- a unified benchmark that enables systematic evaluation of memory-enhanced agents across diverse scenarios, and (3) we develop MIKASA-Robo (pip install mikasa-robo-suite) -- a novel benchmark of 32 carefully designed memory-intensive tasks that assess memory capabilities in tabletop robotic manipulation. Our work introduces a unified framework to advance memory RL research, enabling more robust systems for real-world use. MIKASA is available at https://tinyurl.com/membenchrobots.

Paper Structure

This paper contains 114 sections, 22 figures, 7 tables.

Figures (22)

  • Figure 1: Systematic classification of problems with memory in RL reveals distinct memory utilization patterns and enables objective evaluation of memory mechanisms across different agents.
  • Figure 2: Illustration of demonstrative memory-intensive tasks execution from the proposed MIKASA-Robo benchmark. The ShellGameTouch-v0 task requires the agent to memorize the ball's location under mugs and touch the correct one. In RememberColor9-v0, the agent must memorize a cube's color and later select the matching one. In RotateLenientPos-v0, the agent must rotate a peg while keeping track of its previous rotations.
  • Figure 3: MIKASA bridges the gap between human-like memory complexity and RL agents requirements. While agents tasks don’t require the full spectrum of human memory capabilities, they can’t be reduced to simple spatio-temporal dependencies. MIKASA provides a balanced framework that captures essential memory aspects for agents tasks while maintaining practical simplicity.
  • Figure 4: Performance of PPO-MLP trained in state mode, i.e., in MDP mode without the need for memory. These results suggest that the proposed tasks are inherently solvable with a success rate of 100$\%$.
  • Figure 5: Online RL baselines with MLP and LSTM backbones trained in RGB+joints mode on the RememberColor-v0 environment with dense rewards. Both architectures fail to solve medium and high complexity tasks.
  • ...and 17 more figures