Table of Contents
Fetching ...

Parallel Heuristic Search as Inference for Actor-Critic Reinforcement Learning Models

Hanlan Yang, Itamar Mishani, Luca Pivetti, Zachary Kingston, Maxim Likhachev

TL;DR

This work tackles the deployment bottleneck of RL in robotics by enabling multi-step, forward-inference reasoning through a parallel best-first search that leverages both components of an actor-critic model. It introduces PACHS, which uses the actor $\pi_{\theta}$ to generate candidate actions and the critic $Q_{\phi}(s,a)$ as a learned cost-to-go heuristic within a parallel edge-expansion framework, achieving significant computational efficiency via multi-level parallelization. Empirical results on Panda-Shelf motion planning and PushT tasks show high success rates and robust generalization to obstacle-rich settings without retraining, while maintaining competitive planning times and reduced edge evaluations thanks to the critic-guided priorities, e.g., $f(e)=g(s)+w\cdot Q_{\phi}(s,a)$. This demonstrates practical zero-shot planning with learned inference and points toward real-robot deployment and theoretical guarantees as promising future directions.

Abstract

Actor-Critic models are a class of model-free deep reinforcement learning (RL) algorithms that have demonstrated effectiveness across various robot learning tasks. While considerable research has focused on improving training stability and data sampling efficiency, most deployment strategies have remained relatively simplistic, typically relying on direct actor policy rollouts. In contrast, we propose \pachs{} (\textit{P}arallel \textit{A}ctor-\textit{C}ritic \textit{H}euristic \textit{S}earch), an efficient parallel best-first search algorithm for inference that leverages both components of the actor-critic architecture: the actor network generates actions, while the critic network provides cost-to-go estimates to guide the search. Two levels of parallelism are employed within the search -- actions and cost-to-go estimates are generated in batches by the actor and critic networks respectively, and graph expansion is distributed across multiple threads. We demonstrate the effectiveness of our approach in robotic manipulation tasks, including collision-free motion planning and contact-rich interactions such as non-prehensile pushing. Visit p-achs.github.io for demonstrations and examples.

Parallel Heuristic Search as Inference for Actor-Critic Reinforcement Learning Models

TL;DR

This work tackles the deployment bottleneck of RL in robotics by enabling multi-step, forward-inference reasoning through a parallel best-first search that leverages both components of an actor-critic model. It introduces PACHS, which uses the actor to generate candidate actions and the critic as a learned cost-to-go heuristic within a parallel edge-expansion framework, achieving significant computational efficiency via multi-level parallelization. Empirical results on Panda-Shelf motion planning and PushT tasks show high success rates and robust generalization to obstacle-rich settings without retraining, while maintaining competitive planning times and reduced edge evaluations thanks to the critic-guided priorities, e.g., . This demonstrates practical zero-shot planning with learned inference and points toward real-robot deployment and theoretical guarantees as promising future directions.

Abstract

Actor-Critic models are a class of model-free deep reinforcement learning (RL) algorithms that have demonstrated effectiveness across various robot learning tasks. While considerable research has focused on improving training stability and data sampling efficiency, most deployment strategies have remained relatively simplistic, typically relying on direct actor policy rollouts. In contrast, we propose \pachs{} (\textit{P}arallel \textit{A}ctor-\textit{C}ritic \textit{H}euristic \textit{S}earch), an efficient parallel best-first search algorithm for inference that leverages both components of the actor-critic architecture: the actor network generates actions, while the critic network provides cost-to-go estimates to guide the search. Two levels of parallelism are employed within the search -- actions and cost-to-go estimates are generated in batches by the actor and critic networks respectively, and graph expansion is distributed across multiple threads. We demonstrate the effectiveness of our approach in robotic manipulation tasks, including collision-free motion planning and contact-rich interactions such as non-prehensile pushing. Visit p-achs.github.io for demonstrations and examples.

Paper Structure

This paper contains 17 sections, 6 equations, 7 figures, 2 algorithms.

Figures (7)

  • Figure 1: Pachs leverages both actor and critic networks in a parallel best-first search: the actor generates candidate actions while the critic provides learned heuristics to guide exploration through the state space. Here shown for the push-T task, our algorithm builds an implicit lattice graph to find the trajectory for the robotic arm to manipulate a T-shaped object to a target pose.
  • Figure 2: Pachs multi-level parallelization. Yellow elements represent completed expansions and evaluations. The $\textit{OPEN}$ list maintains edge candidates ordered by a priority function. Red edges ($\mathit{e}\xspace_5$, $\mathit{e}\xspace_6$) undergo parallel evaluation across threads, while blue states ($\mathit{s}\xspace_5$, $\mathit{s}\xspace_6$) have their actions and heuristics generated in batches. Both processes execute simultaneously, demonstrating CPU thread-level and GPU batch-level parallelization.
  • Figure 3: Our four simulated environments. Upper left: $\textit{Panda-Shelf}$---find a collision-free motion plan from the start state to a random EE target position within the shelves (white marker). Upper right: $\textit{PushT-Fixed}$---Push the T-shaped object from a random start state to a fixed target pose (gray T shape). Bottom left: $\textit{PushT-Rand}$---push T from a random start pose to a random goal pose. Bottom right: $\textit{PushT-Obs}$---push T from a random start pose to a fixed target pose with added obstacles--the T-shaped object must be pushed between the blocks to reach the goal.
  • Figure 4: Results for collision-free motion planning in the Panda shelf environment. Upper right: Success rate and planning time for each algorithm. Only eP*ASE and Pachs consistently find successful motions. Unlike ePA*SE, however, Pachs uses two neural networks during planning—one for action generation and one for cost-to-go evaluation—computationally expensive processes. Despite this overhead, its planning time remains comparable. Lower right: Confusion matrix (row to column ratio) showing that Pachs produces solutions with costs similar to the baselines. Left: Pachs uses learned modules that focus the search, resulting in fewer evaluated nodes per number of expanded nodes, supporting the use of the critic network to estimate edge costs and prioritize them for evaluation.
  • Figure 5: Algorithms comparison across three push-T task domains: $\textit{PushT-Fixed}$, $\textit{PushT-Rand}$, $\textit{PushT-Obs}$. We test each planner's ability to find a plan within a budget of 100,000 edge evaluations, without considering execution. For each scenario, we generate 30 different instances and run each 5 times to obtain average and confidence interval statistics, totaling 150 runs per scenario and planner. As shown in the left plot, while policy rollout achieves 93% success rate in the fixed goal case, Pachs achieves 100%. In the random goal case, despite the policy's 20% success rate, Pachs achieves 100%. When adding obstacles to the fixed goal environment, Pachs generalizes significantly better than both single rollout and parallel rollout methods. The planning time and cost results shown are based only on successful runs, and while PACHS has slower planning times, it achieves substantially lower costs and solves more instances overall.
  • ...and 2 more figures