Table of Contents
Fetching ...

Squeezing More from the Stream : Learning Representation Online for Streaming Reinforcement Learning

Nilaksh, Antoine Clavaud, Mathieu Reymond, François Rivest, Sarath Chandar

TL;DR

This work tackles sample inefficiency in streaming reinforcement learning, where transitions are used once and discarded. It introduces Self-Predictive Representations (SPR) as an auxiliary objective and develops orthogonal gradient updates to avoid interference with the streaming RL optimization, including reconciliation with the ObGD optimizer. Empirically, SPR substantially improves performance on Atari, MinAtar, and Octax and yields higher-rank latent representations, as shown by latent-space analyses and t-SNE visuals. The method remains computationally efficient, suitable for CPU-based on-device learning, and demonstrates that dense auxiliary supervision can bridge much of the gap created by the absence of replay buffers. Overall, SPR with orthogonal updates provides a practical path to more sample-efficient, on-device streaming RL.

Abstract

In streaming Reinforcement Learning (RL), transitions are observed and discarded immediately after a single update. While this minimizes resource usage for on-device applications, it makes agents notoriously sample-inefficient, since value-based losses alone struggle to extract meaningful representations from transient data. We propose extending Self-Predictive Representations (SPR) to the streaming pipeline to maximize the utility of every observed frame. However, due to the highly correlated samples induced by the streaming regime, naively applying this auxiliary loss results in training instabilities. Thus, we introduce orthogonal gradient updates relative to the momentum target and resolve gradient conflicts arising from streaming-specific optimizers. Validated across the Atari, MinAtar, and Octax suites, our approach systematically outperforms existing streaming baselines. Latent-space analysis, including t-SNE visualizations and effective-rank measurements, confirms that our method learns significantly richer representations, bridging the performance gap caused by the absence of a replay buffer, while remaining efficient enough to train on just a few CPU cores.

Squeezing More from the Stream : Learning Representation Online for Streaming Reinforcement Learning

TL;DR

This work tackles sample inefficiency in streaming reinforcement learning, where transitions are used once and discarded. It introduces Self-Predictive Representations (SPR) as an auxiliary objective and develops orthogonal gradient updates to avoid interference with the streaming RL optimization, including reconciliation with the ObGD optimizer. Empirically, SPR substantially improves performance on Atari, MinAtar, and Octax and yields higher-rank latent representations, as shown by latent-space analyses and t-SNE visuals. The method remains computationally efficient, suitable for CPU-based on-device learning, and demonstrates that dense auxiliary supervision can bridge much of the gap created by the absence of replay buffers. Overall, SPR with orthogonal updates provides a practical path to more sample-efficient, on-device streaming RL.

Abstract

In streaming Reinforcement Learning (RL), transitions are observed and discarded immediately after a single update. While this minimizes resource usage for on-device applications, it makes agents notoriously sample-inefficient, since value-based losses alone struggle to extract meaningful representations from transient data. We propose extending Self-Predictive Representations (SPR) to the streaming pipeline to maximize the utility of every observed frame. However, due to the highly correlated samples induced by the streaming regime, naively applying this auxiliary loss results in training instabilities. Thus, we introduce orthogonal gradient updates relative to the momentum target and resolve gradient conflicts arising from streaming-specific optimizers. Validated across the Atari, MinAtar, and Octax suites, our approach systematically outperforms existing streaming baselines. Latent-space analysis, including t-SNE visualizations and effective-rank measurements, confirms that our method learns significantly richer representations, bridging the performance gap caused by the absence of a replay buffer, while remaining efficient enough to train on just a few CPU cores.
Paper Structure (43 sections, 12 equations, 9 figures, 10 tables, 1 algorithm)

This paper contains 43 sections, 12 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: Streaming RL setups so far (TOP) have only relied on the Q Learning loss to learn task specific representations. However we propose using an additional representation learning loss (BOTTOM) to help the encoder learn better representations. The t-SNE visualizations of the latents of a sample trajectory affirm this since our training method produces a smooth latent trajectory, with temporally closer samples also closer in the t-SNE plot as shown by the color gradient. Takeaway:Representation learning loss makes a better use of the streaming data than Q learning loss alone, and improves policy performance.
  • Figure 2: We visualize the t-SNE maaten2008visualizing (with a perplexity of 30) of the encoder latents computed on a random 5000 steps trajectory on the Atari Alien environment with the QRC($\lambda$) elelimy2025deep agent. They show how the encoder latents evolve over the training, with the first column being the random initialization and the last column being after 40M frames. The color gradient shows the temporal relation between the samples. The SPR latents (TOP) display distinct clusters made up of several continuous segments and evolve much faster than the non-SPR latents (BOTTOM), which either lack these structures or show them late in their training. Figure \ref{['fig:tsne-traj-many']} in Appendix \ref{['appendix:latent']} shows visualizations on other environments and perplexities.
  • Figure 3: The learning curves of the different algorithms on Atari (TOP), MinAtar (MIDDLE) and Octax (BOTTOM). We see that SPR leads to much better sample efficiency for QRC($\lambda$) on all environments, and for streaming DQN on most Atari environments. Stream Q($\lambda$) without modifications turns out to be incompatible with SPR, however adding another orthogonalization step orth$^2$ (Sec. \ref{['sec:reconciling-grads']}) improves SPR performance and yields higher returns (Table \ref{['tab:aggregate_results']}). Please see Sec. \ref{['sec:algo_desc']} for a description of the legend names. Figures \ref{['fig:learning-curves-octax']} and \ref{['fig:learning-curves-atari-100M']} in Appendix \ref{['appendix:additional-curves']} show training curves on additional environments.
  • Figure 4: The cosine similarity between the gradients back-propagated from the Q network and the SPR networks, averaged over 5 Atari environments. In case of QRC($\lambda$) these come from the SGD optimizer, however Stream Q($\lambda$) uses ObGD for the Q network and SGD for SPR networks. This leads to conflicting gradient updates indicated by the negative cosine similarity, possibly explaining the poor SPR performance with Stream Q($\lambda$). In contrast, for QRC($\lambda$), the gradients are almost orthogonal.
  • Figure 5: The aggregate metrics with a 95% confidence interval for the algorithms across different domains: Atari (TOP), MinAtar (MIDDLE) and Octax (BOTTOM). Here IQM stands for interquartile mean. Each experiment was run for five random seeds. We see that adding SPR increases aggregate performance for both streaming DQN and QRC($\lambda$) across all metrics. The addition of orthogonal updates also helps SPR work better in the streaming setting. SPR fails to work with Stream Q($\lambda$) due to gradient conflicts. However after orthogonalizing SPR gradients with respect to ObGD gradients as described in Sec. \ref{['sec:reconciling-grads']}, we see a marked improvement in SPR performance, with a higher median and IQM than Stream Q($\lambda$).
  • ...and 4 more figures