Table of Contents
Fetching ...

Escaping Offline Pessimism: Vector-Field Reward Shaping for Safe Frontier Exploration

Amirhossein Roknilamouki, Arnob Ghosh, Eylem Ekici, Ness B. Shroff

Abstract

While offline reinforcement learning provides reliable policies for real-world deployment, its inherent pessimism severely restricts an agent's ability to explore and collect novel data online. Drawing inspiration from safe reinforcement learning, exploring near the boundary of regions well covered by the offline dataset and reliably modeled by the simulator allows an agent to take manageable risks--venturing into informative but moderate-uncertainty states while remaining close enough to familiar regions for safe recovery. However, naively rewarding this boundary-seeking behavior can lead to a degenerate parking behavior, where the agent simply stops once it reaches the frontier. To solve this, we propose a novel vector-field reward shaping paradigm designed to induce continuous, safe boundary exploration for non-adaptive deployed policies. Operating on an uncertainty oracle trained from offline data, our reward combines two complementary components: a gradient-alignment term that attracts the agent toward a target uncertainty level, and a rotational-flow term that promotes motion along the local tangent plane of the uncertainty manifold. Through theoretical analysis, we show that this reward structure naturally induces sustained exploratory behavior along the boundary while preventing degenerate solutions. Empirically, by integrating our proposed reward shaping with Soft Actor-Critic on a 2D continuous navigation task, we validate that agents successfully traverse uncertainty boundaries while balancing safe, informative data collection with primary task completion.

Escaping Offline Pessimism: Vector-Field Reward Shaping for Safe Frontier Exploration

Abstract

While offline reinforcement learning provides reliable policies for real-world deployment, its inherent pessimism severely restricts an agent's ability to explore and collect novel data online. Drawing inspiration from safe reinforcement learning, exploring near the boundary of regions well covered by the offline dataset and reliably modeled by the simulator allows an agent to take manageable risks--venturing into informative but moderate-uncertainty states while remaining close enough to familiar regions for safe recovery. However, naively rewarding this boundary-seeking behavior can lead to a degenerate parking behavior, where the agent simply stops once it reaches the frontier. To solve this, we propose a novel vector-field reward shaping paradigm designed to induce continuous, safe boundary exploration for non-adaptive deployed policies. Operating on an uncertainty oracle trained from offline data, our reward combines two complementary components: a gradient-alignment term that attracts the agent toward a target uncertainty level, and a rotational-flow term that promotes motion along the local tangent plane of the uncertainty manifold. Through theoretical analysis, we show that this reward structure naturally induces sustained exploratory behavior along the boundary while preventing degenerate solutions. Empirically, by integrating our proposed reward shaping with Soft Actor-Critic on a 2D continuous navigation task, we validate that agents successfully traverse uncertainty boundaries while balancing safe, informative data collection with primary task completion.
Paper Structure (15 sections, 3 theorems, 19 equations, 5 figures, 1 table)

This paper contains 15 sections, 3 theorems, 19 equations, 5 figures, 1 table.

Key Result

Lemma 1

A reward function consisting solely of a gradient field yields zero net exploratory return for any trajectory confined to the target manifold $\mathcal{U}$, or for any closed loop.

Figures (5)

  • Figure 1: Toy navigation with localized uncertainty.(a) Example trajectories from the same start/goal. A pessimistic baseline (blue, dashed) takes a conservative detour to avoid the high-uncertainty region. Our approach drives the agent to the target manifold $\mathcal{U}$ (green), rotates along the boundary to collect informative samples, and then reaches the goal without entering the uncertain interior (shaded). (b) Visualization of the induced vector field for controlled exploration. The blue arrows direct the agent toward the high-uncertainty regions, seeking the target uncertainty boundary $U_{\mathrm{mid}}$. Upon reaching this target level, the orange arrows (Curl Boundary Band) induce a tangential flow, ensuring the agent continuously moves along and explores the surface of the manifold.
  • Figure 2: Our vector field reward design results in a periodic orbit behavior (the agent rotate around the uncertainty region to collect as much data as it can)
  • Figure 3: State-based intrinsic reward baseline demonstrating mode collapse. Without the rotational flow component, the agent's state visitations (blue points) collapse into a highly localized cluster on the target manifold $\mathcal{U}$ rather than continuously exploring the frontier.
  • Figure 4: Balancing target manifold coverage with a primary navigation task. (a) Our method induces a time-splitting strategy, actively circulating the manifold before navigating to the goal. (b) The baseline intrinsic reward fails to achieve diverse boundary coverage, moving almost directly to the goal.
  • Figure 5: For higher dimensions ($d > 2$), our reward function in Eq. \ref{['eq:rewardDesign']} generates different tangential vector fields for different choices of the skew-symmetric matrix $W$. Because rotational flow is not unique in these spaces, each arbitrary $W$ dictates a distinct rotational trajectory along the boundary $\mathcal{U}$.

Theorems & Definitions (8)

  • Example 1: Navigation with Localized Uncertainty
  • Lemma 1: Insufficiency of Gradient Fields for Boundary Exploration
  • Theorem 1: Manifold-seeking orthogonal reward shaping
  • Remark 1: Behavioral Trade-offs in Near-Manifold Concentration
  • proof
  • Corollary 1: No-sticking on the manifold (in expectation)
  • proof
  • Remark 2: Challenges of State Marginal Matching for Stationary Deployment