STORI: A Benchmark and Taxonomy for Stochastic Environments

Aryan Amit Barsainyan; Jing Yu Lim; Dianbo Liu

STORI: A Benchmark and Taxonomy for Stochastic Environments

Aryan Amit Barsainyan, Jing Yu Lim, Dianbo Liu

TL;DR

STORI introduces a five-type stochasticity taxonomy and a configurable Atari-based benchmark to stress-test reinforcement learning methods under diverse uncertainties. By evaluating DreamerV3 and STORM, the paper reveals systematic vulnerabilities of world models, including underestimation of environmental variance, sensitivity to action corruption, and unstable dynamics under partial observability and concept drift. The framework enables targeted diagnostic probes, reproducible experimentation, and a pathway toward stochasticity-aware world-models and policies. Overall, STORI provides a practical, open-source platform to advance robust RL in real-world, uncertain settings.

Abstract

Reinforcement learning (RL) techniques have achieved impressive performance on simulated benchmarks such as Atari100k, yet recent advances remain largely confined to simulation and show limited transfer to real-world domains. A central obstacle is environmental stochasticity, as real systems involve noisy observations, unpredictable dynamics, and non-stationary conditions that undermine the stability of current methods. Existing benchmarks rarely capture these uncertainties and favor simplified settings where algorithms can be tuned to succeed. The absence of a well-defined taxonomy of stochasticity further complicates evaluation, as robustness to one type of stochastic perturbation, such as sticky actions, does not guarantee robustness to other forms of uncertainty. To address this critical gap, we introduce STORI (STOchastic-ataRI), a benchmark that systematically incorporates diverse stochastic effects and enables rigorous evaluation of RL techniques under different forms of uncertainty. We propose a comprehensive five-type taxonomy of environmental stochasticity and demonstrate systematic vulnerabilities in state-of-the-art model-based RL algorithms through targeted evaluation of DreamerV3 and STORM. Our findings reveal that world models dramatically underestimate environmental variance, struggle with action corruption, and exhibit unreliable dynamics under partial observability. We release the code and benchmark publicly at https://github.com/ARY2260/stori, providing a unified framework for developing more robust RL systems.

STORI: A Benchmark and Taxonomy for Stochastic Environments

TL;DR

Abstract

STORI: A Benchmark and Taxonomy for Stochastic Environments

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)