Confidence-Controlled Exploration: Efficient Sparse-Reward Policy Learning for Robot Navigation

Bhrij Patel; Kasun Weerakoon; Wesley A. Suttle; Alec Koppel; Brian M. Sadler; Tianyi Zhou; Amrit Singh Bedi; Dinesh Manocha

Confidence-Controlled Exploration: Efficient Sparse-Reward Policy Learning for Robot Navigation

Bhrij Patel, Kasun Weerakoon, Wesley A. Suttle, Alec Koppel, Brian M. Sadler, Tianyi Zhou, Amrit Singh Bedi, Dinesh Manocha

TL;DR

This work tackles sparse-reward robotic navigation by introducing Confidence-Controlled Exploration (CCE), a method that adapts trajectory length based on policy entropy to balance exploration and exploitation without modifying the reward function. By using $H(\pi_{\theta})$ as a proxy for mixing time, CCE computes a trajectory length $K_c$ with the mapping $K_c = \text{round}(\alpha - \frac{H_c}{H_0}(\alpha - K_0))$, and applies this during training across on-policy and off-policy RL methods. The approach yields substantial gains in sample efficiency and navigation quality, achieving about 18% higher success rates, 20–38% shorter paths, and 9.32% lower elevation costs under a fixed training budget in both simulated outdoor environments and real-world Husky experiments. Notably, CCE transfers to real hardware with minimal performance degradation, though limitations arise in bottleneck-rich environments where the entropy–mixing-time relation weakens. The results suggest that uncertainty-guided trajectory-length adjustment is a practical, general blueprint for improving sparse-reward RL in robotics without reward shaping.

Abstract

Reinforcement learning (RL) is a promising approach for robotic navigation, allowing robots to learn through trial and error. However, real-world robotic tasks often suffer from sparse rewards, leading to inefficient exploration and suboptimal policies due to sample inefficiency of RL. In this work, we introduce Confidence-Controlled Exploration (CCE), a novel method that improves sample efficiency in RL-based robotic navigation without modifying the reward function. Unlike existing approaches, such as entropy regularization and reward shaping, which can introduce instability by altering rewards, CCE dynamically adjusts trajectory length based on policy entropy. Specifically, it shortens trajectories when uncertainty is high to enhance exploration and extends them when confidence is high to prioritize exploitation. CCE is a principled and practical solution inspired by a theoretical connection between policy entropy and gradient estimation. It integrates seamlessly with on-policy and off-policy RL methods and requires minimal modifications. We validate CCE across REINFORCE, PPO, and SAC in both simulated and real-world navigation tasks. CCE outperforms fixed-trajectory and entropy-regularized baselines, achieving an 18\% higher success rate, 20-38\% shorter paths, and 9.32\% lower elevation costs under a fixed training sample budget. Finally, we deploy CCE on a Clearpath Husky robot, demonstrating its effectiveness in complex outdoor environments.

Confidence-Controlled Exploration: Efficient Sparse-Reward Policy Learning for Robot Navigation

TL;DR

as a proxy for mixing time, CCE computes a trajectory length

with the mapping

, and applies this during training across on-policy and off-policy RL methods. The approach yields substantial gains in sample efficiency and navigation quality, achieving about 18% higher success rates, 20–38% shorter paths, and 9.32% lower elevation costs under a fixed training budget in both simulated outdoor environments and real-world Husky experiments. Notably, CCE transfers to real hardware with minimal performance degradation, though limitations arise in bottleneck-rich environments where the entropy–mixing-time relation weakens. The results suggest that uncertainty-guided trajectory-length adjustment is a practical, general blueprint for improving sparse-reward RL in robotics without reward shaping.

Abstract

Paper Structure (13 sections, 4 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 4 equations, 6 figures, 2 tables, 1 algorithm.

Introduction
Related Works
Problem Formulation
Proposed Approach: Confidence Controlled Exploration-based Policy Learning
Policy Entropy to Control Navigation Exploration
CCE: Confidence-Controlled Exploration
Experiments and Results
Evaluation Metrics
Results and Discussions
Performance in Environments with Bottlenecks
Conclusions, Limitations, and Future Works
Appendix
Experimental Setup Details

Figures (6)

Figure 1: The left figure shows sample trajectories of a robotic agent in a real-world test environment. The agent utilizes navigation policies trained with a policy gradient algorithm (REINFORCE) using four trajectory length schemes. We compare three baseline trajectory length schemes against our CCE method (orange). CCE results in policies that generate shorter and more successful test-time trajectories while saving a larger percentage of the training sample budget as indicated in the right figure. More details on the baselines, policy gradient algorithm, and calculating percentage of sample budget saved can be found in Sec. \ref{['sec:result']}.
Figure 2: Comparison of four different trajectory-length schemes on [LEFT] even and [RIGHT] uneven terrain navigation tasks in Unity-based outdoor robot simulator. For REINFORCE, PPO, and SAC, using CCE as the trajectory length scheme for $150$ episodes leads to a policy with a shorter path length return than baselines with limited samples. The drawn trajectories are sample representations of the robot’s odometry data collected. The trajectories pictured are representative of all $3$ algorithms since we found that each trajectory-length scheme leads to similar navigation regardless of the RL algorithm. See Figures \ref{['fig:even_training']} and \ref{['fig:uneven_training']} for training details.
Figure 3: [TOP ROW] Learning curves during training and [BOTTOM] percent of sample budget ($\sim6.0e4$) saved during navigation tasks for the even terrain (left of Figure \ref{['fig:terrain_nav']}). Comparison of [LEFT COLUMN] REINFORCE, [MIDDLE] PPO, and [RIGHT] SAC, with constant and adaptive trajectory lengths. For each algorithm, CCE converges to a higher cumulative reward while expending less of the total budget. We ran $8$ to $10$ independent replications for each of the $12$ experiments. Although a fixed trajectory length of $200$ saves the sample budget comparably to CCE, it achieves lower reward in all cases due to the robot flipping over thus causing episodes to end.
Figure 4: Sample navigation trajectories generated in [TOP ROW] simulated and [BOTTOM] real-world environments by policies trained with REINFORCE using CCE vs. fixed trajectory lengths at a fixed sample budget of $\sim 4.5e4$. We observed that the fixed trajectory length policies often fail at test time due to inefficient use of the available sample budget during training. In contrast, CCE policies successfully complete the tasks after training with the same limited sample budget. The experiments with a real Clearpath Husky robot validate that CCE can be transferred onto real robotic systems without significant performance degradation. The drawn trajectories are sample representations of the robot's odometry data.
Figure 5: [TOP ROW] Learning curves during training and [BOTTOM] percent of sample budget ($\sim6.0e4$) saved during navigation tasks for the uneven terrain (right of Figure \ref{['fig:terrain_nav']}). Comparison of [LEFT COLUMN] REINFORCE, [MIDDLE] PPO, and [RIGHT] SAC, with constant and adaptive trajectory lengths. For each algorithm, CCE converges to a higher cumulative reward while expending less of the total budget. We ran $8$ to $10$ independent replications for each of the $12$ experiments. Although a fixed trajectory length of $200$ saves the sample budget comparably to CCE, it achieves lower reward in all cases due to the robot flipping over thus causing episodes to end.
...and 1 more figures

Confidence-Controlled Exploration: Efficient Sparse-Reward Policy Learning for Robot Navigation

TL;DR

Abstract

Confidence-Controlled Exploration: Efficient Sparse-Reward Policy Learning for Robot Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)