Table of Contents
Fetching ...

Prompting Decision Transformers for Zero-Shot Reach-Avoid Policies

Kevin Li, Marinka Zitnik

TL;DR

This work addresses safety-critical reach-avoid decision-making where avoid constraints may change at evaluation time. It introduces RADT, a prompt-based decision transformer that encodes goals and avoid regions as input prompts, enabling reward-free offline learning and zero-shot generalization to arbitrarily many and sized avoid regions. A novel two-pass hindsight avoid-region relabeling strategy allows learning from suboptimal offline trajectories without rewards. Across robotics benchmarks and a cellular reprogramming case study, RADT achieves competitive or superior performance to retrained baselines and demonstrates robust zero-shot avoidance capabilities in both continuous and discrete stochastic domains.

Abstract

Offline goal-conditioned reinforcement learning methods have shown promise for reach-avoid tasks, where an agent must reach a target state while avoiding undesirable regions of the state space. Existing approaches typically encode avoid-region information into an augmented state space and cost function, which prevents flexible, dynamic specification of novel avoid-region information at evaluation time. They also rely heavily on well-designed reward and cost functions, limiting scalability to complex or poorly structured environments. We introduce RADT, a decision transformer model for offline, reward-free, goal-conditioned, avoid region-conditioned RL. RADT encodes goals and avoid regions directly as prompt tokens, allowing any number of avoid regions of arbitrary size to be specified at evaluation time. Using only suboptimal offline trajectories from a random policy, RADT learns reach-avoid behavior through a novel combination of goal and avoid-region hindsight relabeling. We benchmark RADT against 3 existing offline goal-conditioned RL models across 11 tasks, environments, and experimental settings. RADT generalizes in a zero-shot manner to out-of-distribution avoid region sizes and counts, outperforming baselines that require retraining. In one such zero-shot setting, RADT achieves 35.7% improvement in normalized cost over the best retrained baseline while maintaining high goal-reaching success. We apply RADT to cell reprogramming in biology, where it reduces visits to undesirable intermediate gene expression states during trajectories to desired target states, despite stochastic transitions and discrete, structured state dynamics.

Prompting Decision Transformers for Zero-Shot Reach-Avoid Policies

TL;DR

This work addresses safety-critical reach-avoid decision-making where avoid constraints may change at evaluation time. It introduces RADT, a prompt-based decision transformer that encodes goals and avoid regions as input prompts, enabling reward-free offline learning and zero-shot generalization to arbitrarily many and sized avoid regions. A novel two-pass hindsight avoid-region relabeling strategy allows learning from suboptimal offline trajectories without rewards. Across robotics benchmarks and a cellular reprogramming case study, RADT achieves competitive or superior performance to retrained baselines and demonstrates robust zero-shot avoidance capabilities in both continuous and discrete stochastic domains.

Abstract

Offline goal-conditioned reinforcement learning methods have shown promise for reach-avoid tasks, where an agent must reach a target state while avoiding undesirable regions of the state space. Existing approaches typically encode avoid-region information into an augmented state space and cost function, which prevents flexible, dynamic specification of novel avoid-region information at evaluation time. They also rely heavily on well-designed reward and cost functions, limiting scalability to complex or poorly structured environments. We introduce RADT, a decision transformer model for offline, reward-free, goal-conditioned, avoid region-conditioned RL. RADT encodes goals and avoid regions directly as prompt tokens, allowing any number of avoid regions of arbitrary size to be specified at evaluation time. Using only suboptimal offline trajectories from a random policy, RADT learns reach-avoid behavior through a novel combination of goal and avoid-region hindsight relabeling. We benchmark RADT against 3 existing offline goal-conditioned RL models across 11 tasks, environments, and experimental settings. RADT generalizes in a zero-shot manner to out-of-distribution avoid region sizes and counts, outperforming baselines that require retraining. In one such zero-shot setting, RADT achieves 35.7% improvement in normalized cost over the best retrained baseline while maintaining high goal-reaching success. We apply RADT to cell reprogramming in biology, where it reduces visits to undesirable intermediate gene expression states during trajectories to desired target states, despite stochastic transitions and discrete, structured state dynamics.

Paper Structure

This paper contains 27 sections, 7 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: (a) An ideal reach-avoid model should learn to avoid arbitrarily specified regions of varying number and size at evaluation time, using only suboptimal, random-policy training data. (b) RADT is the only model that satisfies all criteria for an ideal reach-avoid learner (Section \ref{['problemformulation']}).
  • Figure 2: (a) RADT receives goal states and avoid regions as prompt inputs. (b) Avoid regions are defined as boxes in the state space and encoded as vectors of bounding box corner coordinates. (c) For each offline trajectory, we generate two versions: one that violates a sampled avoid region and one that avoids it. Both are labeled with an avoid success token $z$. (d) Prior models encode avoid regions via augmented state vectors, which grow with the number of avoid regions, preventing zero-shot generalization to unseen avoid counts.
  • Figure 3: (a) Visualization of the FetchReachObstacle environment. The red point is the goal; the blue box is the avoid region. (b) Unlike prior setups, the robot arm can pass through avoid boxes, allowing training data to include violations. (c) RADT and AM-Lag achieve state-of-the-art reach-avoid performance on in-distribution box sizes, measured by MNC and SR. (d) RADT generalizes zero-shot to out-of-distribution avoid box sizes, matching or surpassing the best baseline (AM-Lag), which needs to be retrained on every new avoid box size.
  • Figure 4: (a) Visualization of the MazeObstacle environment, with red goal, blue avoid regions, and green agent. (b) RADT outperforms all baselines on MNC and SR in the in-distribution single-avoid setting. (c) RADT generalizes zero-shot to out-of-distribution numbers of avoid regions, matching the best retrained baseline (AM-Lag) in MNC and surpassing it in SR. Note that AM-Lag is retrained on every new number of avoid regions (i.e., non zero-shot). Error bars show ±1 standard deviation.
  • Figure 5: (a) Cell reprogramming involves sequential gene perturbations to reach a target expression state while avoiding unsafe intermediate states. (b) Evaluation pipeline: RADT is first run without an avoid token. The most frequently visited intermediate state (e.g., gray cell state) is then added as an avoid token, and RADT is re-evaluated. Ideally, the new trajectories will go through the gray cell state less often. (c) RADT reduces visitation frequency to specified avoid states and, when avoidance is infeasible, minimizes time spent in those states. Error bars show ±1 standard deviation. Some illustrations adapted from NIAID NIH BIOART (Appendix \ref{['sec:references-appendix']}).
  • ...and 4 more figures