Another Vertical View: A Hierarchical Network for Heterogeneous Trajectory Prediction via Spectrums
Beihao Xia, Conghao Wong, Duanquan Xu, Qinmu Peng, Xinge You
TL;DR
This work tackles forecasting heterogeneous trajectories that come in diverse representations by reframing trajectories as time-frequency spectrums. It introduces V$^{2}$-Net and its enhanced version E-V$^{2}$-Net, which leverage transforms (DFT and Haar) to capture per-dimension dynamics and propose a bilinear fusion to model dimension-wise interactions across trajectory dimensions. The approach enables hierarchical prediction across frequency scales and across trajectory forms, with Transformer-based encoders/decoders to integrate spectrum representations and interactions. Empirically, E-V$^{2}$-Net variants achieve strong or state-of-the-art performance on ETH-UCY, SDD, nuScenes, and Human3.6M across 2D coordinates, bounding boxes, and 3D skeletons, while analyses highlight transform-specific trade-offs and the value of dimension-wise interactions for high-dimensional trajectories.
Abstract
With the fast development of AI-related techniques, the applications of trajectory prediction are no longer limited to easier scenes and trajectories. More and more trajectories with different forms, such as coordinates, bounding boxes, and even high-dimensional human skeletons, need to be analyzed and forecasted. Among these heterogeneous trajectories, interactions between different elements within a frame of trajectory, which we call ``Dimension-wise Interactions'', would be more complex and challenging. However, most previous approaches focus mainly on a specific form of trajectories, and potential dimension-wise interactions are less concerned. In this work, we expand the trajectory prediction task by introducing the trajectory dimensionality $M$, thus extending its application scenarios to heterogeneous trajectories. We first introduce the Haar transform as an alternative to Fourier transform to better capture the time-frequency properties of each trajectory-dimension. Then, we adopt the bilinear structure to model and fuse two factors simultaneously, including the time-frequency response and the dimension-wise interaction, to forecast heterogeneous trajectories via trajectory spectrums hierarchically in a generic way. Experiments show that the proposed model outperforms most state-of-the-art methods on ETH-UCY, SDD, nuScenes, and Human3.6M with heterogeneous trajectories, including 2D coordinates, 2D/3D bounding boxes, and 3D human skeletons.
