Confidence-Controlled Exploration: Efficient Sparse-Reward Policy Learning for Robot Navigation
Bhrij Patel, Kasun Weerakoon, Wesley A. Suttle, Alec Koppel, Brian M. Sadler, Tianyi Zhou, Amrit Singh Bedi, Dinesh Manocha
TL;DR
This work tackles sparse-reward robotic navigation by introducing Confidence-Controlled Exploration (CCE), a method that adapts trajectory length based on policy entropy to balance exploration and exploitation without modifying the reward function. By using $H(\pi_{\theta})$ as a proxy for mixing time, CCE computes a trajectory length $K_c$ with the mapping $K_c = \text{round}(\alpha - \frac{H_c}{H_0}(\alpha - K_0))$, and applies this during training across on-policy and off-policy RL methods. The approach yields substantial gains in sample efficiency and navigation quality, achieving about 18% higher success rates, 20–38% shorter paths, and 9.32% lower elevation costs under a fixed training budget in both simulated outdoor environments and real-world Husky experiments. Notably, CCE transfers to real hardware with minimal performance degradation, though limitations arise in bottleneck-rich environments where the entropy–mixing-time relation weakens. The results suggest that uncertainty-guided trajectory-length adjustment is a practical, general blueprint for improving sparse-reward RL in robotics without reward shaping.
Abstract
Reinforcement learning (RL) is a promising approach for robotic navigation, allowing robots to learn through trial and error. However, real-world robotic tasks often suffer from sparse rewards, leading to inefficient exploration and suboptimal policies due to sample inefficiency of RL. In this work, we introduce Confidence-Controlled Exploration (CCE), a novel method that improves sample efficiency in RL-based robotic navigation without modifying the reward function. Unlike existing approaches, such as entropy regularization and reward shaping, which can introduce instability by altering rewards, CCE dynamically adjusts trajectory length based on policy entropy. Specifically, it shortens trajectories when uncertainty is high to enhance exploration and extends them when confidence is high to prioritize exploitation. CCE is a principled and practical solution inspired by a theoretical connection between policy entropy and gradient estimation. It integrates seamlessly with on-policy and off-policy RL methods and requires minimal modifications. We validate CCE across REINFORCE, PPO, and SAC in both simulated and real-world navigation tasks. CCE outperforms fixed-trajectory and entropy-regularized baselines, achieving an 18\% higher success rate, 20-38\% shorter paths, and 9.32\% lower elevation costs under a fixed training sample budget. Finally, we deploy CCE on a Clearpath Husky robot, demonstrating its effectiveness in complex outdoor environments.
