Temporal-contextual Event Learning for Pedestrian Crossing Intent Prediction
Hongbin Liang, Hezhe Qiao, Wei Huang, Qizhou Wang, Mingsheng Shang, Lin Chen
TL;DR
PCI prediction for vulnerable road users is addressed by TCL, combining temporal merging of frames into key events and contextual attention fusion of visual and non-visual data within an enhanced ViT (EnViT) backbone. The Temporal Merging Module (TMM) reduces frame redundancy by clustering the sequence into $N$ key events, while the Fusion Attention Block (FAB) adaptively fuses event-level visual and non-visual cues, yielding a joint representation $O = anh(W_p^2 [F_p; F'])$. Non-visual features include pose keypoints and traffic-object cues, enabling robust PCI prediction. Experiments on PIE, JAAD-beh, and JAAD-all demonstrate state-of-the-art performance with high recall, underscoring practical impact for safer autonomous driving.
Abstract
Ensuring the safety of vulnerable road users through accurate prediction of pedestrian crossing intention (PCI) plays a crucial role in the context of autonomous and assisted driving. Analyzing the set of observation video frames in ego-view has been widely used in most PCI prediction methods to forecast the cross intent. However, they struggle to capture the critical events related to pedestrian behaviour along the temporal dimension due to the high redundancy of the video frames, which results in the sub-optimal performance of PCI prediction. Our research addresses the challenge by introducing a novel approach called \underline{T}emporal-\underline{c}ontextual Event \underline{L}earning (TCL). The TCL is composed of the Temporal Merging Module (TMM), which aims to manage the redundancy by clustering the observed video frames into multiple key temporal events. Then, the Contextual Attention Block (CAB) is employed to adaptively aggregate multiple event features along with visual and non-visual data. By synthesizing the temporal feature extraction and contextual attention on the key information across the critical events, TCL can learn expressive representation for the PCI prediction. Extensive experiments are carried out on three widely adopted datasets, including PIE, JAAD-beh, and JAAD-all. The results show that TCL substantially surpasses the state-of-the-art methods. Our code can be accessed at https://github.com/dadaguailhb/TCL.
