Test-Time Graph Search for Goal-Conditioned Reinforcement Learning

Evgenii Opryshko; Junwei Quan; Claas Voelcker; Yilun Du; Igor Gilitschenski

Test-Time Graph Search for Goal-Conditioned Reinforcement Learning

Evgenii Opryshko, Junwei Quan, Claas Voelcker, Yilun Du, Igor Gilitschenski

TL;DR

Test-Time Graph Search (TTGS) augments offline goal-conditioned reinforcement learning by performing inference-time planning over a graph built from the pre-collected dataset. It converts a distance signal, typically derived from a goal-conditioned value function, into edge costs and uses shortest-path search to generate subgoals that guide a frozen policy, without any retraining or online interaction. The method supports value-derived as well as domain-specific distances, and introduces a soft-horizon penalty to discourage unreliable long jumps. Across OGBench benchmarks, TTGS yields substantial improvements on long-horizon stitching tasks for multiple base learners, demonstrating that simple, metric-guided planning can unlock latent long-horizon competence in value-based GCRL agents. The approach is lightweight, flexible, and readily reusable with existing offline RL pipelines, underscoring the practical impact of test-time planning for offline datasets.

Abstract

Offline goal-conditioned reinforcement learning (GCRL) trains policies that reach user-specified goals at test time, providing a simple, unsupervised, domain-agnostic way to extract diverse behaviors from unlabeled, reward-free datasets. Nonetheless, long-horizon decision making remains difficult for GCRL agents due to temporal credit assignment and error accumulation, and the offline setting amplifies these effects. To alleviate this issue, we introduce Test-Time Graph Search (TTGS), a lightweight planning approach to solve the GCRL task. TTGS accepts any state-space distance or cost signal, builds a weighted graph over dataset states, and performs fast search to assemble a sequence of subgoals that a frozen policy executes. When the base learner is value-based, the distance is derived directly from the learned goal-conditioned value function, so no handcrafted metric is needed. TTGS requires no changes to training, no additional supervision, no online interaction, and no privileged information, and it runs entirely at inference. On the OGBench benchmark, TTGS improves success rates of multiple base learners on challenging locomotion tasks, demonstrating the benefit of simple metric-guided test-time planning for offline GCRL.

Test-Time Graph Search for Goal-Conditioned Reinforcement Learning

TL;DR

Abstract

Test-Time Graph Search for Goal-Conditioned Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)