A Parallel CPU-GPU Framework for Batching Heuristic Operations in Depth-First Heuristic Search

Ehsan Futuhi; Nathan R. Sturtevant

A Parallel CPU-GPU Framework for Batching Heuristic Operations in Depth-First Heuristic Search

Ehsan Futuhi, Nathan R. Sturtevant

TL;DR

This paper tackles the batching challenge for depth-first heuristic search by designing a CPU-driven parallel CB-DFS framework that feeds batched neural heuristic evaluations to GPUs. By distributing subtrees on the CPU and batching heuristic lookups on the GPU, the authors implement SingleGPU Batch CB-DFS and extend it to Multi-GPU setups, proving correctness and demonstrating strong speedups on the Rubik's Cube and 15-puzzle with both classifier- and regression-based heuristics. Key contributions include Batch IDA* and Batch BTS with formal correctness guarantees, memory-cost analyses, and extensive empirical results showing up to tens of times speedups over non-batched baselines and competitive performance with AIDA*. The work signifies a practical path to leveraging larger neural heuristics in depth-first search, enabling more scalable, high-quality heuristic guidance for complex domains.

Abstract

The rapid advancement of GPU technology has unlocked powerful parallel processing capabilities, creating new opportunities to enhance classic search algorithms. This hardware has been exploited in best-first search algorithms with neural network-based heuristics by creating batched versions of A* and Weighted A* that delay heuristic evaluation until sufficiently many states can be evaluated in parallel on the GPU. But, research has not addressed how depth-first algorithms like IDA* or Budgeted Tree Search (BTS) can have their heuristic computations batched. This is more complicated in a tree search, because progress in the search tree is blocked until heuristic evaluations are complete. In this paper we show that GPU parallelization of heuristics can be effectively performed when the tree search is parallelized on the CPU while heuristic evaluations are parallelized on the GPU. We develop a parallelized cost-bounded depth-first search (CB-DFS) framework that can be applied to both IDA* and BTS, significantly improving their performance. We demonstrate the strength of the approach on the 3x3 Rubik's Cube and the 4x4 sliding tile puzzle (STP) with both classifier-based and regression-based heuristics.

A Parallel CPU-GPU Framework for Batching Heuristic Operations in Depth-First Heuristic Search

TL;DR

Abstract

A Parallel CPU-GPU Framework for Batching Heuristic Operations in Depth-First Heuristic Search

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (5)