Table of Contents
Fetching ...

StreamingFlow: Streaming Occupancy Forecasting with Asynchronous Multi-modal Data Streams via Neural Ordinary Differential Equation

Yining Shi, Kun Jiang, Ke Wang, Jiusi Li, Yunlong Wang, Mengmeng Yang, Diange Yang

TL;DR

StreamingFlow tackles the problem of predicting occupancy flows in a driving scene from asynchronous multi-modal streams. It introduces SpatialGRU-ODE, a Neural-ODE–augmented GRU that learns feature derivatives and supports update/predict cycles across arbitrary future timestamps, enabling continuous-time BEV occupancy forecasting. The method enables asynchronous fusion and streaming predictions at varying horizons, achieving state-of-the-art results on nuScenes and Lyft L5 with robust zero-shot generalization. This approach reduces synchronization constraints and latency, offering a practical path toward real-time, fine-grained environment awareness for autonomous systems.

Abstract

Predicting the future occupancy states of the surrounding environment is a vital task for autonomous driving. However, current best-performing single-modality methods or multi-modality fusion perception methods are only able to predict uniform snapshots of future occupancy states and require strictly synchronized sensory data for sensor fusion. We propose a novel framework, StreamingFlow, to lift these strong limitations. StreamingFlow is a novel BEV occupancy predictor that ingests asynchronous multi-sensor data streams for fusion and performs streaming forecasting of the future occupancy map at any future timestamps. By integrating neural ordinary differential equations (N-ODE) into recurrent neural networks, StreamingFlow learns derivatives of BEV features over temporal horizons, updates the implicit sensor's BEV features as part of the fusion process, and propagates BEV states to the desired future time point. It shows good zero-shot generalization ability of prediction, reflected in the interpolation of the observed prediction time horizon and the reasonable inference of the unseen farther future period. Extensive experiments on two large-scale datasets, nuScenes and Lyft L5, demonstrate that StreamingFlow significantly outperforms previous vision-based, LiDAR-based methods, and shows superior performance compared to state-of-the-art fusion-based methods.

StreamingFlow: Streaming Occupancy Forecasting with Asynchronous Multi-modal Data Streams via Neural Ordinary Differential Equation

TL;DR

StreamingFlow tackles the problem of predicting occupancy flows in a driving scene from asynchronous multi-modal streams. It introduces SpatialGRU-ODE, a Neural-ODE–augmented GRU that learns feature derivatives and supports update/predict cycles across arbitrary future timestamps, enabling continuous-time BEV occupancy forecasting. The method enables asynchronous fusion and streaming predictions at varying horizons, achieving state-of-the-art results on nuScenes and Lyft L5 with robust zero-shot generalization. This approach reduces synchronization constraints and latency, offering a practical path toward real-time, fine-grained environment awareness for autonomous systems.

Abstract

Predicting the future occupancy states of the surrounding environment is a vital task for autonomous driving. However, current best-performing single-modality methods or multi-modality fusion perception methods are only able to predict uniform snapshots of future occupancy states and require strictly synchronized sensory data for sensor fusion. We propose a novel framework, StreamingFlow, to lift these strong limitations. StreamingFlow is a novel BEV occupancy predictor that ingests asynchronous multi-sensor data streams for fusion and performs streaming forecasting of the future occupancy map at any future timestamps. By integrating neural ordinary differential equations (N-ODE) into recurrent neural networks, StreamingFlow learns derivatives of BEV features over temporal horizons, updates the implicit sensor's BEV features as part of the fusion process, and propagates BEV states to the desired future time point. It shows good zero-shot generalization ability of prediction, reflected in the interpolation of the observed prediction time horizon and the reasonable inference of the unseen farther future period. Extensive experiments on two large-scale datasets, nuScenes and Lyft L5, demonstrate that StreamingFlow significantly outperforms previous vision-based, LiDAR-based methods, and shows superior performance compared to state-of-the-art fusion-based methods.
Paper Structure (27 sections, 8 equations, 7 figures, 11 tables, 1 algorithm)

This paper contains 27 sections, 8 equations, 7 figures, 11 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison between conventional synchronized BEV fusion (top) and our asynchronous BEV fusion (bottom). We formulate fusion with an update-predict-update approach.
  • Figure 2: The framework of StreamingFlow. Raw data streams are encoded to BEV features, respectively. The SpatialGRU-ODE process operates on the timeline with two stages split by the present timestamp, asynchronous multi-sensor deep feature via the SpatialGRU-ODE update process and continuous occupancy flow prediction via SpatialGRU-ODE predict process.
  • Figure 3: Illustration of the measurement update process of SpatialGRU-ODE in temporal-agnostic fusion.
  • Figure 4: Illustration of streaming prediction process of SpatialGRU-ODE.
  • Figure 5: Visualization of StreamingFlow for diverse driving scenarios. Different colors represent different instances of the agents, and lighter colors represent the future occupancy of the agents. (top): samples from Lyft dataset, highway (left), and urban (right). (middle and bottom): samples from nuScenes dataset, sunny (middle left), overcast (middle right), rainy (bottom left), and night (bottom right). StreamingFlow works well in all challenging driving scenarios.
  • ...and 2 more figures