JEDI: Latent End-to-end Diffusion Mitigates Agent-Human Performance Asymmetry in Model-Based Reinforcement Learning
Jing Yu Lim, Zarif Ikram, Samson Yu, Haozhe Ma, Tze-Yun Leong, Dianbo Liu
TL;DR
This work tackles the misalignment between human and agent performance in Atari100k by revealing a pronounced task-wise asymmetry in pixel-based MBRL and proposing JEDI, a latent end-to-end diffusion world model trained with a self-consistency objective inspired by JEPA. By integrating a latent diffusion dynamics model with an end-to-end encoder, JEDI achieves state-of-the-art results on human-optimal tasks while staying competitive overall, and it does so with substantial efficiency gains (faster inference, faster training, and lower memory) thanks to latent compression. The key contribution is demonstrating that temporally structured latent representations learned end-to-end via diffusion can bridge the gap between agent and human performance, challenging prior reliance on pixel-space diffusion and detached encoders. Overall, the work advances holistic human-level performance assessment in Atari100k and offers a scalable, efficient framework for temporally aware, model-based RL.
Abstract
Recent advances in model-based reinforcement learning (MBRL) have achieved super-human level performance on the Atari100k benchmark, driven by reinforcement learning agents trained on powerful diffusion world models. However, we identify that the current aggregates mask a major performance asymmetry: MBRL agents dramatically outperform humans in some tasks despite drastically underperforming in others, with the former inflating the aggregate metrics. This is especially pronounced in pixel-based agents trained with diffusion world models. In this work, we address the pronounced asymmetry observed in pixel-based agents as an initial attempt to reverse the worrying upward trend observed in them. We address the problematic aggregates by delineating all tasks as Agent-Optimal or Human-Optimal and advocate for equal importance on metrics from both sets. Next, we hypothesize this pronounced asymmetry is due to the lack of temporally-structured latent space trained with the World Model objective in pixel-based methods. Lastly, to address this issue, we propose Joint Embedding DIffusion (JEDI), a novel latent diffusion world model trained end-to-end with the self-consistency objective. JEDI outperforms SOTA models in human-optimal tasks while staying competitive across the Atari100k benchmark, and runs 3 times faster with 43% lower memory than the latest pixel-based diffusion baseline. Overall, our work rethinks what it truly means to cross human-level performance in Atari100k.
