Table of Contents
Fetching ...

Motion Perceiver: Real-Time Occupancy Forecasting for Embedded Systems

Bryce Ferenczi, Michael Burke, Tom Drummond

TL;DR

The paper tackles real-time occupancy forecasting for autonomous systems under a streaming sensor paradigm. It introduces MotionPerceiver, a transformer-based latent-state model that evolves over time with a learned time evolution and updates its state using cross- and self-attention as new observations arrive. Key contributions include a data-streaming architecture that avoids per-agent tracking, achieves competitive AUC and superior Soft IoU on the Waymo Open Motion Dataset, and supports localized occupancy queries suitable for downstream planning. The approach demonstrates strong edge inference capabilities on devices like the Nvidia Xavier AGX, robustness to occlusions, and potential extensions to ego-action conditioning and multi-trajectory planning for practical deployment.

Abstract

This work introduces a novel and adaptable architecture designed for real-time occupancy forecasting that outperforms existing state-of-the-art models on the Waymo Open Motion Dataset in Soft IOU. The proposed model uses recursive latent state estimation with learned transformer-based functions to effectively update and evolve the state. This enables highly efficient real-time inference on embedded systems, as profiled on an Nvidia Xavier AGX. Our model, MotionPerceiver, achieves this by encoding a scene into a latent state that evolves in time through self-attention mechanisms. Additionally, it incorporates relevant scene observations, such as traffic signals, road topology and agent detections, through cross-attention mechanisms. This forms an efficient data-streaming architecture, that contrasts with the expensive, fixed-sequence input common in existing models. The architecture also offers the distinct advantage of generating occupancy predictions through localized querying based on a point-of-interest, as opposed to generating fixed-size occupancy images that render potentially irrelevant regions.

Motion Perceiver: Real-Time Occupancy Forecasting for Embedded Systems

TL;DR

The paper tackles real-time occupancy forecasting for autonomous systems under a streaming sensor paradigm. It introduces MotionPerceiver, a transformer-based latent-state model that evolves over time with a learned time evolution and updates its state using cross- and self-attention as new observations arrive. Key contributions include a data-streaming architecture that avoids per-agent tracking, achieves competitive AUC and superior Soft IoU on the Waymo Open Motion Dataset, and supports localized occupancy queries suitable for downstream planning. The approach demonstrates strong edge inference capabilities on devices like the Nvidia Xavier AGX, robustness to occlusions, and potential extensions to ego-action conditioning and multi-trajectory planning for practical deployment.

Abstract

This work introduces a novel and adaptable architecture designed for real-time occupancy forecasting that outperforms existing state-of-the-art models on the Waymo Open Motion Dataset in Soft IOU. The proposed model uses recursive latent state estimation with learned transformer-based functions to effectively update and evolve the state. This enables highly efficient real-time inference on embedded systems, as profiled on an Nvidia Xavier AGX. Our model, MotionPerceiver, achieves this by encoding a scene into a latent state that evolves in time through self-attention mechanisms. Additionally, it incorporates relevant scene observations, such as traffic signals, road topology and agent detections, through cross-attention mechanisms. This forms an efficient data-streaming architecture, that contrasts with the expensive, fixed-sequence input common in existing models. The architecture also offers the distinct advantage of generating occupancy predictions through localized querying based on a point-of-interest, as opposed to generating fixed-size occupancy images that render potentially irrelevant regions.
Paper Structure (21 sections, 17 equations, 6 figures, 3 tables)

This paper contains 21 sections, 17 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: MotionPerceiver implements recursive state estimation for motion forecasting. Here, an initial scene observation ($a$) at $t=0s$ is forecast ($b$) $0.4s$ into the future. This learned prediction begins to accumulate error and uncertainty for targets in motion, and shows no occupancy for an agent that was not initially detected. Frame ($c$) shows the predicted occupancy after a scene observation has applied an update to the latent state at $t=0.5s$. Agent occupancy is refined, without explicit data association, and a new agent added to the state. Images are color coded green$\rightarrow$ true positive (occupancy prediction $>0.5$), blue$\rightarrow$ false positive, red$\rightarrow$ false negative, black $\rightarrow$ rasterized road graph, red dots$\rightarrow$ traffic signals.
  • Figure 2: An illustration of the use of MotionPerceiver's architecture for real-time occupancy prediction. At $t=0$, the latent state is initialised with the first observation of agents in the scene. The evolution of this latent state is shown using dark blue arrows. Tokenized observations from the scene (light red) are queried by the latent state for information. Rasterized road-graph context (light blue) can be encoded once and provide contextual information at each time-step. The latent state can be queried with a position (light orange) to receive an estimate of occupancy probability at each time-step. When there is no observation information ($t=1$), the latent state is simply propagated forward in time and updated with road-graph context. This is the operation used for forecasting future occupancy or interpolation between observations. At time-steps when scene observations are available ($t=2$), the latent state queries the observation for information, reducing accumulated errors and adding newly observed agents. Dimensions $N_x\times C_x$ describes the number of tokens $N_x$ and channels per token $C_x$. Additional inference diagrams can be found at https://sites.google.com/monash.edu/motionperceiver.
  • Figure 3: Self- and cross-attention is used to apply changes to the latent state. The latent state (dark blue), is always used as the query. Self-attention sources the key-value from the latent state (dark red), communicating information between the $N_L$ variables in the latent state. Cross-attention uses the observation as the key-value (light red) for the latent state to query and transfer information from the observation to the latent state.
  • Figure 4: Soft IoU and AUC at evaluation waypoints on Waymo Open Motion Validation split. Inclusion of contextual features (roadgraph + traffic signals) has a greater effect at later waypoints. Two phase prediction specialization improves performance across the whole sequence.
  • Figure 5: Multi-modal predictions modeled by MotionPerceiver. In this example, the vehicle in the center is predicted to either continue straight or turn left at the intersection. Images are color coded green$\rightarrow$ true positive (occupancy prediction $>0.5$), blue$\rightarrow$ false positive, red$\rightarrow$ false negative, black $\rightarrow$ rasterized road graph.
  • ...and 1 more figures