Table of Contents
Fetching ...

Occlusion-Aware Diffusion Model for Pedestrian Intention Prediction

Yu Liu, Zhijie Liu, Zedong Yang, You-Fu Li, He Kong

TL;DR

This work tackles pedestrian crossing intention prediction under occlusion, a critical scenario for autonomous systems. It introduces Occlusion-Aware Diffusion Model (ODM) that reconstructs occluded motion patterns via an occlusion-masked diffusion transformer and a reverse process guided by occlusion cues. The model jointly learns occlusion reconstruction and intention classification through a multi-task objective while fusing bounding boxes and ego-vehicle velocity via a gating mechanism. Experiments on PIE and JAAD demonstrate superior robustness to occlusion and clear gains over state-of-the-art methods, underscoring ODM's potential for safer autonomous navigation.

Abstract

Predicting pedestrian crossing intentions is crucial for the navigation of mobile robots and intelligent vehicles. Although recent deep learning-based models have shown significant success in forecasting intentions, few consider incomplete observation under occlusion scenarios. To tackle this challenge, we propose an Occlusion-Aware Diffusion Model (ODM) that reconstructs occluded motion patterns and leverages them to guide future intention prediction. During the denoising stage, we introduce an occlusion-aware diffusion transformer architecture to estimate noise features associated with occluded patterns, thereby enhancing the model's ability to capture contextual relationships in occluded semantic scenarios. Furthermore, an occlusion mask-guided reverse process is introduced to effectively utilize observation information, reducing the accumulation of prediction errors and enhancing the accuracy of reconstructed motion features. The performance of the proposed method under various occlusion scenarios is comprehensively evaluated and compared with existing methods on popular benchmarks, namely PIE and JAAD. Extensive experimental results demonstrate that the proposed method achieves more robust performance than existing methods in the literature.

Occlusion-Aware Diffusion Model for Pedestrian Intention Prediction

TL;DR

This work tackles pedestrian crossing intention prediction under occlusion, a critical scenario for autonomous systems. It introduces Occlusion-Aware Diffusion Model (ODM) that reconstructs occluded motion patterns via an occlusion-masked diffusion transformer and a reverse process guided by occlusion cues. The model jointly learns occlusion reconstruction and intention classification through a multi-task objective while fusing bounding boxes and ego-vehicle velocity via a gating mechanism. Experiments on PIE and JAAD demonstrate superior robustness to occlusion and clear gains over state-of-the-art methods, underscoring ODM's potential for safer autonomous navigation.

Abstract

Predicting pedestrian crossing intentions is crucial for the navigation of mobile robots and intelligent vehicles. Although recent deep learning-based models have shown significant success in forecasting intentions, few consider incomplete observation under occlusion scenarios. To tackle this challenge, we propose an Occlusion-Aware Diffusion Model (ODM) that reconstructs occluded motion patterns and leverages them to guide future intention prediction. During the denoising stage, we introduce an occlusion-aware diffusion transformer architecture to estimate noise features associated with occluded patterns, thereby enhancing the model's ability to capture contextual relationships in occluded semantic scenarios. Furthermore, an occlusion mask-guided reverse process is introduced to effectively utilize observation information, reducing the accumulation of prediction errors and enhancing the accuracy of reconstructed motion features. The performance of the proposed method under various occlusion scenarios is comprehensively evaluated and compared with existing methods on popular benchmarks, namely PIE and JAAD. Extensive experimental results demonstrate that the proposed method achieves more robust performance than existing methods in the literature.

Paper Structure

This paper contains 36 sections, 34 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: The typical scenario of visual occlusion is illustrated here. Solid green lines represent the parts of the observation that are within the field of view and visible, while dashed red lines indicate positional features that are undetectable due to occlusion. Observations with occlusion pose a significant challenge for pedestrian intention prediction.
  • Figure 2: The overall framework of the ODM. The occluded observations are first embedded into the diffusion block to recover the missing motion features caused by occlusion. These recovered features are then used to estimate the crossing intention through the transformer block.
  • Figure 3: Illustration of the diffusion process for motion indeterminacy variation. In the forward process, noise is gradually added to the raw observation sequences $X_{k}^{raw}$. In the reverse process, the added noise is removed by leveraging the clues provided by the occluded observations to recover observation $X_{k}^{rec}$.
  • Figure 4: The training architecture of the noise estimation process consists of three blocks. In the noise addition block, raw observation sequences are corrupted with the noise of specific density, scaled by the diffusion step $k$, to generate the noised feature $X_{k}^{raw}$. In the observation extraction block, occluded pedestrian observations are integrated with the ego-vehicle's speed through a gating mechanism. These combined features are then added to the features from step $k$ to form the observation vector $X_{k}^{obs}$. In the noise estimation block, an occlusion-masked transformer is employed to predict the noise added at step $k$, enhancing the model’s ability to learn semantic relationships. $\text{SA}$ is sampling operation, $\text{OS}$ indicates offset operation, $\text{N\&S}$ is normalization with scale and shift operation, $\odot$ is element-wise multiplication, and $\oplus$ is element-wise addition.
  • Figure 5: Illustration of the fusion gate mechanism for integrating multimodal input features.
  • ...and 11 more figures