Table of Contents
Fetching ...

OFMPNet: Deep End-to-End Model for Occupancy and Flow Prediction in Urban Environment

Youshaa Murhij, Dmitry Yudin

TL;DR

This work addresses autonomous driving motion prediction by proposing OFMPNet, an end-to-end framework that jointly predicts future occupancy grids for observed and occluded objects and their motion flow using BEV inputs, including occupancy maps and prior flow. It explores three architectures (OFMPNet-Swin, OFMPNet-R2AttU-T2, OFMPNet-ULSTM) that fuse trajectory history with scene context via Swin Transformer, cross-attention, and recurrent units, and introduces a time-weighted flow loss to improve end-point accuracy. The approach achieves state-of-the-art results on the Waymo Open Motion Dataset, reporting Soft-IoU around 0.50–0.52 and Flow-Grounded Occupancy AUC around 0.76–0.77, validating the effectiveness of dense, multi-timestep BEV predictions. A key implication is that dense occupancy/flow outputs enable robust multi-object forecasting without relying on explicit object counts, though the method currently depends on HD maps, pointing to future work toward map-free and on-device deployment for practical autonomous systems.

Abstract

The task of motion prediction is pivotal for autonomous driving systems, providing crucial data to choose a vehicle behavior strategy within its surroundings. Existing motion prediction techniques primarily focus on predicting the future trajectory of each agent in the scene individually, utilizing its past trajectory data. In this paper, we introduce an end-to-end neural network methodology designed to predict the future behaviors of all dynamic objects in the environment. This approach leverages the occupancy map and the scene's motion flow. We are investigatin various alternatives for constructing a deep encoder-decoder model called OFMPNet. This model uses a sequence of bird's-eye-view road images, occupancy grid, and prior motion flow as input data. The encoder of the model can incorporate transformer, attention-based, or convolutional units. The decoder considers the use of both convolutional modules and recurrent blocks. Additionally, we propose a novel time-weighted motion flow loss, whose application has shown a substantial decrease in end-point error. Our approach has achieved state-of-the-art results on the Waymo Occupancy and Flow Prediction benchmark, with a Soft IoU of 52.1% and an AUC of 76.75% on Flow-Grounded Occupancy.

OFMPNet: Deep End-to-End Model for Occupancy and Flow Prediction in Urban Environment

TL;DR

This work addresses autonomous driving motion prediction by proposing OFMPNet, an end-to-end framework that jointly predicts future occupancy grids for observed and occluded objects and their motion flow using BEV inputs, including occupancy maps and prior flow. It explores three architectures (OFMPNet-Swin, OFMPNet-R2AttU-T2, OFMPNet-ULSTM) that fuse trajectory history with scene context via Swin Transformer, cross-attention, and recurrent units, and introduces a time-weighted flow loss to improve end-point accuracy. The approach achieves state-of-the-art results on the Waymo Open Motion Dataset, reporting Soft-IoU around 0.50–0.52 and Flow-Grounded Occupancy AUC around 0.76–0.77, validating the effectiveness of dense, multi-timestep BEV predictions. A key implication is that dense occupancy/flow outputs enable robust multi-object forecasting without relying on explicit object counts, though the method currently depends on HD maps, pointing to future work toward map-free and on-device deployment for practical autonomous systems.

Abstract

The task of motion prediction is pivotal for autonomous driving systems, providing crucial data to choose a vehicle behavior strategy within its surroundings. Existing motion prediction techniques primarily focus on predicting the future trajectory of each agent in the scene individually, utilizing its past trajectory data. In this paper, we introduce an end-to-end neural network methodology designed to predict the future behaviors of all dynamic objects in the environment. This approach leverages the occupancy map and the scene's motion flow. We are investigatin various alternatives for constructing a deep encoder-decoder model called OFMPNet. This model uses a sequence of bird's-eye-view road images, occupancy grid, and prior motion flow as input data. The encoder of the model can incorporate transformer, attention-based, or convolutional units. The decoder considers the use of both convolutional modules and recurrent blocks. Additionally, we propose a novel time-weighted motion flow loss, whose application has shown a substantial decrease in end-point error. Our approach has achieved state-of-the-art results on the Waymo Occupancy and Flow Prediction benchmark, with a Soft IoU of 52.1% and an AUC of 76.75% on Flow-Grounded Occupancy.
Paper Structure (22 sections, 13 equations, 5 figures, 4 tables)

This paper contains 22 sections, 13 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Visualisation of output data for the discussed problem based on Waymo Occupancy and Flow Prediction Challenge data ettinger2021large
  • Figure 2: Overview of proposed OFMPNet architectures
  • Figure 3: Time-based weight $w_t$ for flow loss
  • Figure 4: Comparison between OFMPNet flow prediction results on WOM Val-Set
  • Figure 5: Qualitative results from Waymo Open Motion Dataset using our pretrained OFMPNet-Swin-T model