Table of Contents
Fetching ...

Test-Time Graph Search for Goal-Conditioned Reinforcement Learning

Evgenii Opryshko, Junwei Quan, Claas Voelcker, Yilun Du, Igor Gilitschenski

TL;DR

Test-Time Graph Search (TTGS) augments offline goal-conditioned reinforcement learning by performing inference-time planning over a graph built from the pre-collected dataset. It converts a distance signal, typically derived from a goal-conditioned value function, into edge costs and uses shortest-path search to generate subgoals that guide a frozen policy, without any retraining or online interaction. The method supports value-derived as well as domain-specific distances, and introduces a soft-horizon penalty to discourage unreliable long jumps. Across OGBench benchmarks, TTGS yields substantial improvements on long-horizon stitching tasks for multiple base learners, demonstrating that simple, metric-guided planning can unlock latent long-horizon competence in value-based GCRL agents. The approach is lightweight, flexible, and readily reusable with existing offline RL pipelines, underscoring the practical impact of test-time planning for offline datasets.

Abstract

Offline goal-conditioned reinforcement learning (GCRL) trains policies that reach user-specified goals at test time, providing a simple, unsupervised, domain-agnostic way to extract diverse behaviors from unlabeled, reward-free datasets. Nonetheless, long-horizon decision making remains difficult for GCRL agents due to temporal credit assignment and error accumulation, and the offline setting amplifies these effects. To alleviate this issue, we introduce Test-Time Graph Search (TTGS), a lightweight planning approach to solve the GCRL task. TTGS accepts any state-space distance or cost signal, builds a weighted graph over dataset states, and performs fast search to assemble a sequence of subgoals that a frozen policy executes. When the base learner is value-based, the distance is derived directly from the learned goal-conditioned value function, so no handcrafted metric is needed. TTGS requires no changes to training, no additional supervision, no online interaction, and no privileged information, and it runs entirely at inference. On the OGBench benchmark, TTGS improves success rates of multiple base learners on challenging locomotion tasks, demonstrating the benefit of simple metric-guided test-time planning for offline GCRL.

Test-Time Graph Search for Goal-Conditioned Reinforcement Learning

TL;DR

Test-Time Graph Search (TTGS) augments offline goal-conditioned reinforcement learning by performing inference-time planning over a graph built from the pre-collected dataset. It converts a distance signal, typically derived from a goal-conditioned value function, into edge costs and uses shortest-path search to generate subgoals that guide a frozen policy, without any retraining or online interaction. The method supports value-derived as well as domain-specific distances, and introduces a soft-horizon penalty to discourage unreliable long jumps. Across OGBench benchmarks, TTGS yields substantial improvements on long-horizon stitching tasks for multiple base learners, demonstrating that simple, metric-guided planning can unlock latent long-horizon competence in value-based GCRL agents. The approach is lightweight, flexible, and readily reusable with existing offline RL pipelines, underscoring the practical impact of test-time planning for offline datasets.

Abstract

Offline goal-conditioned reinforcement learning (GCRL) trains policies that reach user-specified goals at test time, providing a simple, unsupervised, domain-agnostic way to extract diverse behaviors from unlabeled, reward-free datasets. Nonetheless, long-horizon decision making remains difficult for GCRL agents due to temporal credit assignment and error accumulation, and the offline setting amplifies these effects. To alleviate this issue, we introduce Test-Time Graph Search (TTGS), a lightweight planning approach to solve the GCRL task. TTGS accepts any state-space distance or cost signal, builds a weighted graph over dataset states, and performs fast search to assemble a sequence of subgoals that a frozen policy executes. When the base learner is value-based, the distance is derived directly from the learned goal-conditioned value function, so no handcrafted metric is needed. TTGS requires no changes to training, no additional supervision, no online interaction, and no privileged information, and it runs entirely at inference. On the OGBench benchmark, TTGS improves success rates of multiple base learners on challenging locomotion tasks, demonstrating the benefit of simple metric-guided test-time planning for offline GCRL.

Paper Structure

This paper contains 35 sections, 8 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of TTGS. From an offline dataset, we sample observations to form graph vertices. We assign edge weights using a distance signal, either derived from a pretrained goal-conditioned value function or from domain-specific knowledge. A shortest-path search with Dijkstra's algorithm yields a sequence of subgoals that guides a frozen policy at test time.
  • Figure 2: Motivation for TTGS:(a) HIQL policy fails to reach a distant goal on antmaze-giant-stitch-v0, with multiple attempts failing to exit the starting area and two attempts running out of time due to inefficient path. (b) TTGS finds a guiding path using dataset observations. On each step it selects a subgoal which is within a predefined radius from the agent. We mark all data points on the guiding path in gray, and the actual path traversed by the agent in blue. (c) Different agents' policy performance decreases as steps required to reach the goal increase. By providing a policy with close subgoals, TTGS improves reliability and efficiency of reaching the goal.
  • Figure 3: Goal-reaching success rates for QRL, GCIQL, and HIQL with and without TTGS. Distances are predicted from each base agent’s learned value function. TTGS consistently improves or preserves performance on locomotion tasks that require trajectory stitching.
  • Figure 4: Ablations and hyperparameters.(a) Comparison of full TTGS using HIQL as base learner and value-derived distances with two ablations: Next Subgoal replaces our subgoal selection procedure with always picking the immediate next waypoint, and No-Penalty uses raw predicted distances as edge weights instead of penalizing long connections. TTGS outperforms both ablations across datasets. (b) Effect of penalty threshold $\tau$ on the guide path and a value-derived distance field. Colors denote predicted distances from each dataset observation to the goal in top-right corner. Smaller $\tau$ yields denser subgoals and less direct paths. Larger $\tau$ permits longer hops that can require navigating around obstacles, which is harder for the frozen policy.