Table of Contents
Fetching ...

Cost-Aware Diffusion Active Search

Arundhati Banerjee, Jeff Schneider

TL;DR

This work identifies the optimism bias in prior diffusion based reinforcement learning approaches when applied to the active search setting and proposes mitigating solutions for efficient cost-aware decision making with both single and multi-agent teams.

Abstract

Active search for recovering objects of interest through online, adaptive decision making with autonomous agents requires trading off exploration of unknown environments with exploitation of prior observations in the search space. Prior work has proposed information gain and Thompson sampling based myopic, greedy approaches for agents to actively decide query or search locations when the number of targets is unknown. Decision making algorithms in such partially observable environments have also shown that agents capable of lookahead over a finite horizon outperform myopic policies for active search. Unfortunately, lookahead algorithms typically rely on building a computationally expensive search tree that is simulated and updated based on the agent's observations and a model of the environment dynamics. Instead, in this work, we leverage the sequence modeling abilities of diffusion models to sample lookahead action sequences that balance the exploration-exploitation trade-off for active search without building an exhaustive search tree. We identify the optimism bias in prior diffusion based reinforcement learning approaches when applied to the active search setting and propose mitigating solutions for efficient cost-aware decision making with both single and multi-agent teams. Our proposed algorithm outperforms standard baselines in offline reinforcement learning in terms of full recovery rate and is computationally more efficient than tree search in cost-aware active decision making.

Cost-Aware Diffusion Active Search

TL;DR

This work identifies the optimism bias in prior diffusion based reinforcement learning approaches when applied to the active search setting and proposes mitigating solutions for efficient cost-aware decision making with both single and multi-agent teams.

Abstract

Active search for recovering objects of interest through online, adaptive decision making with autonomous agents requires trading off exploration of unknown environments with exploitation of prior observations in the search space. Prior work has proposed information gain and Thompson sampling based myopic, greedy approaches for agents to actively decide query or search locations when the number of targets is unknown. Decision making algorithms in such partially observable environments have also shown that agents capable of lookahead over a finite horizon outperform myopic policies for active search. Unfortunately, lookahead algorithms typically rely on building a computationally expensive search tree that is simulated and updated based on the agent's observations and a model of the environment dynamics. Instead, in this work, we leverage the sequence modeling abilities of diffusion models to sample lookahead action sequences that balance the exploration-exploitation trade-off for active search without building an exhaustive search tree. We identify the optimism bias in prior diffusion based reinforcement learning approaches when applied to the active search setting and propose mitigating solutions for efficient cost-aware decision making with both single and multi-agent teams. Our proposed algorithm outperforms standard baselines in offline reinforcement learning in terms of full recovery rate and is computationally more efficient than tree search in cost-aware active decision making.
Paper Structure (15 sections, 7 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 15 sections, 7 equations, 8 figures, 1 table, 2 algorithms.

Figures (8)

  • Figure 1: (a) Problem setup. Agents sense different parts of the environment looking for OOIs. True OOIs are crossed in black. Targets detected by the agent in its field of view are crossed in red. (b) Modeling for cost-aware diffusion in active search. Agents sample lookahead action sequences of length $H$ conditioned on the current belief state and with gradient guidance from the estimated cumulative discounted return over the entire sequence.
  • Figure 2: Optimism bias in samples generated by diffuser janner2022planning for active search. Top row in each block (for $T=0\dots 5$) is the posterior mean estimate (darker shade implies higher value) and bottom row indicates true target location and the FOV of the sensing action. Diffusion modeling over joint state-action space with gradient guided sampling of state-action sequences learns to generate deterministic target detection after the first generated action. In other words, generated samples are biased to assume the target will be detected in the first timestep. Such sampled action sequences do not balance exploration-exploitation, explaining why diffuser is unable to generate optimal coverage action sequences for active search.
  • Figure 3: Lookahead vs myopic decision making in 1D search space of size $n = 16$. $J=1$ agent. $k=1$ target. Low ($\sigma=\frac{1}{16}$) and high ($\sigma=0.2$) observation noise. Our diffusion based approach (D-AS) recovers the optimal sequence of actions and achieves full recovery faster than the myopic and shallow search tree based baselines. Plots show mean and standard error over 20 trials.
  • Figure 4: Cost-aware decision making in 1D search space of size $n=16$. $J=1$ agent. $k=1$ target. Low ($\sigma=\frac{1}{16}$) and high ($\sigma=0.2$) observation noise. Our diffusion based approach called CD-AS incurs smaller or competitive total cost compared to CAST when sensing cost is higher than traveling ($c_s=50$s). But when sensing cost is low, CAST selects sensing actions which incur a smaller total cost. Plots show mean and standard error over 20 trials.
  • Figure 5: Lookahead vs. myopic decision making in 2D search space of size $8\times8$. $J=1$ agent. $k=1$ target. Low ($\sigma=\frac{1}{16}$) and high ($\sigma=0.2$) observation noise. Our diffusion based approach (D-AS) samples the optimal sequence of actions and achieves full recovery with fewer measurements compared to myopic active search (EIG), offline RL (IQL), behavior cloning based diffusion (diffusion policy) and shallow online tree search (CAST) baselines. Plots show mean and standard error over 10 trials.
  • ...and 3 more figures