Table of Contents
Fetching ...

STORI: A Benchmark and Taxonomy for Stochastic Environments

Aryan Amit Barsainyan, Jing Yu Lim, Dianbo Liu

TL;DR

STORI introduces a five-type stochasticity taxonomy and a configurable Atari-based benchmark to stress-test reinforcement learning methods under diverse uncertainties. By evaluating DreamerV3 and STORM, the paper reveals systematic vulnerabilities of world models, including underestimation of environmental variance, sensitivity to action corruption, and unstable dynamics under partial observability and concept drift. The framework enables targeted diagnostic probes, reproducible experimentation, and a pathway toward stochasticity-aware world-models and policies. Overall, STORI provides a practical, open-source platform to advance robust RL in real-world, uncertain settings.

Abstract

Reinforcement learning (RL) techniques have achieved impressive performance on simulated benchmarks such as Atari100k, yet recent advances remain largely confined to simulation and show limited transfer to real-world domains. A central obstacle is environmental stochasticity, as real systems involve noisy observations, unpredictable dynamics, and non-stationary conditions that undermine the stability of current methods. Existing benchmarks rarely capture these uncertainties and favor simplified settings where algorithms can be tuned to succeed. The absence of a well-defined taxonomy of stochasticity further complicates evaluation, as robustness to one type of stochastic perturbation, such as sticky actions, does not guarantee robustness to other forms of uncertainty. To address this critical gap, we introduce STORI (STOchastic-ataRI), a benchmark that systematically incorporates diverse stochastic effects and enables rigorous evaluation of RL techniques under different forms of uncertainty. We propose a comprehensive five-type taxonomy of environmental stochasticity and demonstrate systematic vulnerabilities in state-of-the-art model-based RL algorithms through targeted evaluation of DreamerV3 and STORM. Our findings reveal that world models dramatically underestimate environmental variance, struggle with action corruption, and exhibit unreliable dynamics under partial observability. We release the code and benchmark publicly at https://github.com/ARY2260/stori, providing a unified framework for developing more robust RL systems.

STORI: A Benchmark and Taxonomy for Stochastic Environments

TL;DR

STORI introduces a five-type stochasticity taxonomy and a configurable Atari-based benchmark to stress-test reinforcement learning methods under diverse uncertainties. By evaluating DreamerV3 and STORM, the paper reveals systematic vulnerabilities of world models, including underestimation of environmental variance, sensitivity to action corruption, and unstable dynamics under partial observability and concept drift. The framework enables targeted diagnostic probes, reproducible experimentation, and a pathway toward stochasticity-aware world-models and policies. Overall, STORI provides a practical, open-source platform to advance robust RL in real-world, uncertain settings.

Abstract

Reinforcement learning (RL) techniques have achieved impressive performance on simulated benchmarks such as Atari100k, yet recent advances remain largely confined to simulation and show limited transfer to real-world domains. A central obstacle is environmental stochasticity, as real systems involve noisy observations, unpredictable dynamics, and non-stationary conditions that undermine the stability of current methods. Existing benchmarks rarely capture these uncertainties and favor simplified settings where algorithms can be tuned to succeed. The absence of a well-defined taxonomy of stochasticity further complicates evaluation, as robustness to one type of stochastic perturbation, such as sticky actions, does not guarantee robustness to other forms of uncertainty. To address this critical gap, we introduce STORI (STOchastic-ataRI), a benchmark that systematically incorporates diverse stochastic effects and enables rigorous evaluation of RL techniques under different forms of uncertainty. We propose a comprehensive five-type taxonomy of environmental stochasticity and demonstrate systematic vulnerabilities in state-of-the-art model-based RL algorithms through targeted evaluation of DreamerV3 and STORM. Our findings reveal that world models dramatically underestimate environmental variance, struggle with action corruption, and exhibit unreliable dynamics under partial observability. We release the code and benchmark publicly at https://github.com/ARY2260/stori, providing a unified framework for developing more robust RL systems.

Paper Structure

This paper contains 55 sections, 6 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: STORI benchmark and results. (a) Framework for systematic stochasticity evaluation. (b) DreamerV3 and STORM performance degradation under uncertainty (Types: 1=action corruption, 2.1=random events, 2.2=concept drift, 3.1=default, 3.2=missing information).
  • Figure 2: Stochasticity types in STORI benchmark.
  • Figure 3: Partial observability probe showing prediction errors when models start with clear vs. obscured observations.
  • Figure 4: STORI Implementation Overview
  • Figure 5: The figure shows the RAM state of Atari Breakout on the left and corresponding observation image from the emulator on the right, along with annotations for various state variables like ball position, blocks state etc.
  • ...and 5 more figures