Parallel Heuristic Search as Inference for Actor-Critic Reinforcement Learning Models
Hanlan Yang, Itamar Mishani, Luca Pivetti, Zachary Kingston, Maxim Likhachev
TL;DR
This work tackles the deployment bottleneck of RL in robotics by enabling multi-step, forward-inference reasoning through a parallel best-first search that leverages both components of an actor-critic model. It introduces PACHS, which uses the actor $\pi_{\theta}$ to generate candidate actions and the critic $Q_{\phi}(s,a)$ as a learned cost-to-go heuristic within a parallel edge-expansion framework, achieving significant computational efficiency via multi-level parallelization. Empirical results on Panda-Shelf motion planning and PushT tasks show high success rates and robust generalization to obstacle-rich settings without retraining, while maintaining competitive planning times and reduced edge evaluations thanks to the critic-guided priorities, e.g., $f(e)=g(s)+w\cdot Q_{\phi}(s,a)$. This demonstrates practical zero-shot planning with learned inference and points toward real-robot deployment and theoretical guarantees as promising future directions.
Abstract
Actor-Critic models are a class of model-free deep reinforcement learning (RL) algorithms that have demonstrated effectiveness across various robot learning tasks. While considerable research has focused on improving training stability and data sampling efficiency, most deployment strategies have remained relatively simplistic, typically relying on direct actor policy rollouts. In contrast, we propose \pachs{} (\textit{P}arallel \textit{A}ctor-\textit{C}ritic \textit{H}euristic \textit{S}earch), an efficient parallel best-first search algorithm for inference that leverages both components of the actor-critic architecture: the actor network generates actions, while the critic network provides cost-to-go estimates to guide the search. Two levels of parallelism are employed within the search -- actions and cost-to-go estimates are generated in batches by the actor and critic networks respectively, and graph expansion is distributed across multiple threads. We demonstrate the effectiveness of our approach in robotic manipulation tasks, including collision-free motion planning and contact-rich interactions such as non-prehensile pushing. Visit p-achs.github.io for demonstrations and examples.
