Squeezing More from the Stream : Learning Representation Online for Streaming Reinforcement Learning
Nilaksh, Antoine Clavaud, Mathieu Reymond, François Rivest, Sarath Chandar
TL;DR
This work tackles sample inefficiency in streaming reinforcement learning, where transitions are used once and discarded. It introduces Self-Predictive Representations (SPR) as an auxiliary objective and develops orthogonal gradient updates to avoid interference with the streaming RL optimization, including reconciliation with the ObGD optimizer. Empirically, SPR substantially improves performance on Atari, MinAtar, and Octax and yields higher-rank latent representations, as shown by latent-space analyses and t-SNE visuals. The method remains computationally efficient, suitable for CPU-based on-device learning, and demonstrates that dense auxiliary supervision can bridge much of the gap created by the absence of replay buffers. Overall, SPR with orthogonal updates provides a practical path to more sample-efficient, on-device streaming RL.
Abstract
In streaming Reinforcement Learning (RL), transitions are observed and discarded immediately after a single update. While this minimizes resource usage for on-device applications, it makes agents notoriously sample-inefficient, since value-based losses alone struggle to extract meaningful representations from transient data. We propose extending Self-Predictive Representations (SPR) to the streaming pipeline to maximize the utility of every observed frame. However, due to the highly correlated samples induced by the streaming regime, naively applying this auxiliary loss results in training instabilities. Thus, we introduce orthogonal gradient updates relative to the momentum target and resolve gradient conflicts arising from streaming-specific optimizers. Validated across the Atari, MinAtar, and Octax suites, our approach systematically outperforms existing streaming baselines. Latent-space analysis, including t-SNE visualizations and effective-rank measurements, confirms that our method learns significantly richer representations, bridging the performance gap caused by the absence of a replay buffer, while remaining efficient enough to train on just a few CPU cores.
