Table of Contents
Fetching ...

Learning Online Belief Prediction for Efficient POMDP Planning in Autonomous Driving

Zhiyu Huang, Chen Tang, Chen Lv, Masayoshi Tomizuka, Wei Zhan

TL;DR

This work addresses autonomous driving under uncertainty by casting decision-making as a POMDP with a belief state $\mathbf{b}_t$ over other agents and horizon $T$, optimized with discount factor $\gamma$. An online memory-based belief updater using a Transformer encoder and a GRU decoder yields multi-modal future trajectories (with $M$ modalities over horizon $T_f$) that reflect closed-loop interactions. Planning uses a macro-action MCTS guided by a DQN prior to search efficiently within a receding horizon. Experiments on real-world driving data and simulation show improved temporal consistency and decision quality, with online belief updates and DQN guidance driving notable performance gains.

Abstract

Effective decision-making in autonomous driving relies on accurate inference of other traffic agents' future behaviors. To achieve this, we propose an online belief-update-based behavior prediction model and an efficient planner for Partially Observable Markov Decision Processes (POMDPs). We develop a Transformer-based prediction model, enhanced with a recurrent neural memory model, to dynamically update latent belief state and infer the intentions of other agents. The model can also integrate the ego vehicle's intentions to reflect closed-loop interactions among agents, and it learns from both offline data and online interactions. For planning, we employ a Monte-Carlo Tree Search (MCTS) planner with macro actions, which reduces computational complexity by searching over temporally extended action steps. Inside the MCTS planner, we use predicted long-term multi-modal trajectories to approximate future updates, which eliminates iterative belief updating and improves the running efficiency. Our approach also incorporates deep Q-learning (DQN) as a search prior, which significantly improves the performance of the MCTS planner. Experimental results from simulated environments validate the effectiveness of our proposed method. The online belief update model can significantly enhance the accuracy and temporal consistency of predictions, leading to improved decision-making performance. Employing DQN as a search prior in the MCTS planner considerably boosts its performance and outperforms an imitation learning-based prior. Additionally, we show that the MCTS planning with macro actions substantially outperforms the vanilla method in terms of performance and efficiency.

Learning Online Belief Prediction for Efficient POMDP Planning in Autonomous Driving

TL;DR

This work addresses autonomous driving under uncertainty by casting decision-making as a POMDP with a belief state over other agents and horizon , optimized with discount factor . An online memory-based belief updater using a Transformer encoder and a GRU decoder yields multi-modal future trajectories (with modalities over horizon ) that reflect closed-loop interactions. Planning uses a macro-action MCTS guided by a DQN prior to search efficiently within a receding horizon. Experiments on real-world driving data and simulation show improved temporal consistency and decision quality, with online belief updates and DQN guidance driving notable performance gains.

Abstract

Effective decision-making in autonomous driving relies on accurate inference of other traffic agents' future behaviors. To achieve this, we propose an online belief-update-based behavior prediction model and an efficient planner for Partially Observable Markov Decision Processes (POMDPs). We develop a Transformer-based prediction model, enhanced with a recurrent neural memory model, to dynamically update latent belief state and infer the intentions of other agents. The model can also integrate the ego vehicle's intentions to reflect closed-loop interactions among agents, and it learns from both offline data and online interactions. For planning, we employ a Monte-Carlo Tree Search (MCTS) planner with macro actions, which reduces computational complexity by searching over temporally extended action steps. Inside the MCTS planner, we use predicted long-term multi-modal trajectories to approximate future updates, which eliminates iterative belief updating and improves the running efficiency. Our approach also incorporates deep Q-learning (DQN) as a search prior, which significantly improves the performance of the MCTS planner. Experimental results from simulated environments validate the effectiveness of our proposed method. The online belief update model can significantly enhance the accuracy and temporal consistency of predictions, leading to improved decision-making performance. Employing DQN as a search prior in the MCTS planner considerably boosts its performance and outperforms an imitation learning-based prior. Additionally, we show that the MCTS planning with macro actions substantially outperforms the vanilla method in terms of performance and efficiency.
Paper Structure (13 sections, 12 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 13 sections, 12 equations, 5 figures, 3 tables, 2 algorithms.

Figures (5)

  • Figure 1: Illustration of our proposed planning approach. We utilize a neural memory-based belief update model to continually update other human agents' intentions over time based on new observations and the AV's actions. A macro-action-based MCTS planner, guided by a learned Q-value function, searches for the approximately optimal action based on the current belief state.
  • Figure 2: Illustration of the proposed POMDP decision-making framework. (a) Diagram of the POMDP planning process. At each time step, we use a Transformer encoder to map the observation into latent space, which is used to update the latent belief state. We use an MLP decoder to project the latent belief state into probabilistic future trajectories per agent. (b) Macro-action-based MCTS planner. The planner takes as input the estimated future states of other agents and searches over macro-actions. Additionally, we employ a learned Q-value network to guide the selection step at the root node. (c) Structure of the belief update network. The information of the previous latent belief state, current latent observation, and the ego agent's action are fused in the update process.
  • Figure 3: Example of replayed WOMD scenario in the MetaDrive simulator and the processed observation space including vectorized map polylines and historical trajectories of surrounding agents.
  • Figure 4: Training results of our proposed method against baseline methods. Left: average episodic reward; right: average success rate.
  • Figure 5: Qualitative comparisons of different behavior prediction models and their influences on the decision-making task. For clarity, only the most likely modality, along with its predicted probability, is displayed, and the length of trajectories is 5 seconds. The offline prediction model shows significant fluctuations in trajectories and probabilities, which cause the AV to fail to move ahead of the interacting agent in Scenario 1. In contrast, our online update model shows consistent predictions over time, with the accuracy gradually increasing when receiving new observations. The enhancements in behavior prediction provided by our online update model enable the MCTS planner to make more human-like decisions.