AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction
Dubing Chen, Wencheng Han, Jin Fang, Jianbing Shen
TL;DR
The paper tackles 3D occupancy and flow prediction from camera input by introducing a two-stage framework: first train an occupancy model with an adaptive forward view transformation to improve 3D voxel representations, then train a flow model that uses sequential frames and AdaBin-based adaptive binning to predict scalable flows. A voxel feature encoding pipeline combines Lift-Splat-Shoot with semantic depth fusion and temporal fusion, while flow-based warping of voxel features toward future frames is supervised by future ground truth. Key contributions include the AdaBin flow modeling, Ray Visible Mask training focus, and a Swin-Base strengthened setting that yields competitive Occ Scores (0.453) on the nuScenes OpenOcc test set, achieving 2nd place without post-processing. The method demonstrates robust 3D occupancy and temporal flow prediction in real-world driving scenarios and offers practical insights into separating occupancy and flow optimization, as well as attention to traffic-relevant regions for improved performance. This work advances camera-only 3D perception for autonomous driving by integrating adaptive depth, temporal cues, and flow-guided feature alignment.
Abstract
In this technical report, we present our solution for the Vision-Centric 3D Occupancy and Flow Prediction track in the nuScenes Open-Occ Dataset Challenge at CVPR 2024. Our innovative approach involves a dual-stage framework that enhances 3D occupancy and flow predictions by incorporating adaptive forward view transformation and flow modeling. Initially, we independently train the occupancy model, followed by flow prediction using sequential frame integration. Our method combines regression with classification to address scale variations in different scenes, and leverages predicted flow to warp current voxel features to future frames, guided by future frame ground truth. Experimental results on the nuScenes dataset demonstrate significant improvements in accuracy and robustness, showcasing the effectiveness of our approach in real-world scenarios. Our single model based on Swin-Base ranks second on the public leaderboard, validating the potential of our method in advancing autonomous car perception systems.
