Table of Contents
Fetching ...

Fostering Intrinsic Motivation in Reinforcement Learning with Pretrained Foundation Models

Alain Andres, Javier Del Ser

TL;DR

This work investigates whether providing the intrinsic module with complete state information -- rather than just partial observations -- can improve exploration, despite the difficulties in handling small variations within large state spaces, and shows that intrinsic modules can effectively utilize full state information.

Abstract

Exploration remains a significant challenge in reinforcement learning, especially in environments where extrinsic rewards are sparse or non-existent. The recent rise of foundation models, such as CLIP, offers an opportunity to leverage pretrained, semantically rich embeddings that encapsulate broad and reusable knowledge. In this work we explore the potential of these foundation models not just to drive exploration, but also to analyze the critical role of the episodic novelty term in enhancing exploration effectiveness of the agent. We also investigate whether providing the intrinsic module with complete state information -- rather than just partial observations -- can improve exploration, despite the difficulties in handling small variations within large state spaces. Our experiments in the MiniGrid domain reveal that intrinsic modules can effectively utilize full state information, significantly increasing sample efficiency while learning an optimal policy. Moreover, we show that the embeddings provided by foundation models are sometimes even better than those constructed by the agent during training, further accelerating the learning process, especially when coupled with the episodic novelty term to enhance exploration.

Fostering Intrinsic Motivation in Reinforcement Learning with Pretrained Foundation Models

TL;DR

This work investigates whether providing the intrinsic module with complete state information -- rather than just partial observations -- can improve exploration, despite the difficulties in handling small variations within large state spaces, and shows that intrinsic modules can effectively utilize full state information.

Abstract

Exploration remains a significant challenge in reinforcement learning, especially in environments where extrinsic rewards are sparse or non-existent. The recent rise of foundation models, such as CLIP, offers an opportunity to leverage pretrained, semantically rich embeddings that encapsulate broad and reusable knowledge. In this work we explore the potential of these foundation models not just to drive exploration, but also to analyze the critical role of the episodic novelty term in enhancing exploration effectiveness of the agent. We also investigate whether providing the intrinsic module with complete state information -- rather than just partial observations -- can improve exploration, despite the difficulties in handling small variations within large state spaces. Our experiments in the MiniGrid domain reveal that intrinsic modules can effectively utilize full state information, significantly increasing sample efficiency while learning an optimal policy. Moreover, we show that the embeddings provided by foundation models are sometimes even better than those constructed by the agent during training, further accelerating the learning process, especially when coupled with the episodic novelty term to enhance exploration.

Paper Structure

This paper contains 20 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Illustration of full state and partial observation formats in the MiniGrid environment, with two types of representations: encoded and RGB. The top row shows the full state $s_t^{enc}$, where the agent has access to the entire grid layout, represented as a three-channel encoded matrix. The three channels encode object type, color, and state (e.g., open or closed doors, agent orientation). The dotted pink square in this row indicates the region that would compose the agent’s observation in a partial view. The middle row displays this partial observation $o_t^{enc}$, where the agent’s perception is limited to a $7\times7$ egocentric field centered on its position. The green square in this row highlights the area visible to the agent, showing how walls and other objects limit the agent's field of vision. Lastly, the bottom row shows the RGB representation of a partial observation $o_t^{rgb}$, where each cell is represented according to three color channels (red, green, and blue) for a pixel-based view.
  • Figure 2: Performance comparison between RIDE (orange) and FoMoRL (blue) across various MiniGrid environments and input types. Each subfigure presents the average return over training steps, illustrating learning progress and convergence speed for each algorithm in specific settings. Solid lines represent full observations ($s_t$), while dotted lines denote partial observations ($o_t$). The horizontal black dashed line indicates the expected return of the optimal policy. The shaded area represents the standard deviation computed across 3 different seeds.