Table of Contents
Fetching ...

AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction

Dubing Chen, Wencheng Han, Jin Fang, Jianbing Shen

TL;DR

The paper tackles 3D occupancy and flow prediction from camera input by introducing a two-stage framework: first train an occupancy model with an adaptive forward view transformation to improve 3D voxel representations, then train a flow model that uses sequential frames and AdaBin-based adaptive binning to predict scalable flows. A voxel feature encoding pipeline combines Lift-Splat-Shoot with semantic depth fusion and temporal fusion, while flow-based warping of voxel features toward future frames is supervised by future ground truth. Key contributions include the AdaBin flow modeling, Ray Visible Mask training focus, and a Swin-Base strengthened setting that yields competitive Occ Scores (0.453) on the nuScenes OpenOcc test set, achieving 2nd place without post-processing. The method demonstrates robust 3D occupancy and temporal flow prediction in real-world driving scenarios and offers practical insights into separating occupancy and flow optimization, as well as attention to traffic-relevant regions for improved performance. This work advances camera-only 3D perception for autonomous driving by integrating adaptive depth, temporal cues, and flow-guided feature alignment.

Abstract

In this technical report, we present our solution for the Vision-Centric 3D Occupancy and Flow Prediction track in the nuScenes Open-Occ Dataset Challenge at CVPR 2024. Our innovative approach involves a dual-stage framework that enhances 3D occupancy and flow predictions by incorporating adaptive forward view transformation and flow modeling. Initially, we independently train the occupancy model, followed by flow prediction using sequential frame integration. Our method combines regression with classification to address scale variations in different scenes, and leverages predicted flow to warp current voxel features to future frames, guided by future frame ground truth. Experimental results on the nuScenes dataset demonstrate significant improvements in accuracy and robustness, showcasing the effectiveness of our approach in real-world scenarios. Our single model based on Swin-Base ranks second on the public leaderboard, validating the potential of our method in advancing autonomous car perception systems.

AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction

TL;DR

The paper tackles 3D occupancy and flow prediction from camera input by introducing a two-stage framework: first train an occupancy model with an adaptive forward view transformation to improve 3D voxel representations, then train a flow model that uses sequential frames and AdaBin-based adaptive binning to predict scalable flows. A voxel feature encoding pipeline combines Lift-Splat-Shoot with semantic depth fusion and temporal fusion, while flow-based warping of voxel features toward future frames is supervised by future ground truth. Key contributions include the AdaBin flow modeling, Ray Visible Mask training focus, and a Swin-Base strengthened setting that yields competitive Occ Scores (0.453) on the nuScenes OpenOcc test set, achieving 2nd place without post-processing. The method demonstrates robust 3D occupancy and temporal flow prediction in real-world driving scenarios and offers practical insights into separating occupancy and flow optimization, as well as attention to traffic-relevant regions for improved performance. This work advances camera-only 3D perception for autonomous driving by integrating adaptive depth, temporal cues, and flow-guided feature alignment.

Abstract

In this technical report, we present our solution for the Vision-Centric 3D Occupancy and Flow Prediction track in the nuScenes Open-Occ Dataset Challenge at CVPR 2024. Our innovative approach involves a dual-stage framework that enhances 3D occupancy and flow predictions by incorporating adaptive forward view transformation and flow modeling. Initially, we independently train the occupancy model, followed by flow prediction using sequential frame integration. Our method combines regression with classification to address scale variations in different scenes, and leverages predicted flow to warp current voxel features to future frames, guided by future frame ground truth. Experimental results on the nuScenes dataset demonstrate significant improvements in accuracy and robustness, showcasing the effectiveness of our approach in real-world scenarios. Our single model based on Swin-Base ranks second on the public leaderboard, validating the potential of our method in advancing autonomous car perception systems.
Paper Structure (14 sections, 3 equations, 6 figures, 3 tables)

This paper contains 14 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: An illustration of our overall pipeline, including image backbone, view transformation from 2D to 3D, Unet for 3D encoding, and task-specific heads.
  • Figure 2: Framework of the flow head. Note that the feature of the last frame is sequentially predicted without extra computation. The voxel feature is wrapped to the coordinates of the next frame with the predicted flow for further supervision.
  • Figure 3: An illustration of aggregating adaptive bins and adaptive weights to flows.
  • Figure 4: Visualization of the ray visible mask. (a): Groundtruth occupancy map; (b): Traffic-critical regions of the groundtruth occupancy map.
  • Figure A.1: An illustration of our flow prediction.
  • ...and 1 more figures