Table of Contents
Fetching ...

Let Occ Flow: Self-Supervised 3D Occupancy Flow Prediction

Yili Liu, Linzhan Mou, Xuan Yu, Chenrui Han, Sitong Mao, Rong Xiong, Yue Wang

TL;DR

Let Occ Flow is introduced, the first self-supervised work for joint 3D occupancy and occupancy flow prediction using only camera inputs, eliminating the need for 3D annotations.

Abstract

Accurate perception of the dynamic environment is a fundamental task for autonomous driving and robot systems. This paper introduces Let Occ Flow, the first self-supervised work for joint 3D occupancy and occupancy flow prediction using only camera inputs, eliminating the need for 3D annotations. Utilizing TPV for unified scene representation and deformable attention layers for feature aggregation, our approach incorporates a novel attention-based temporal fusion module to capture dynamic object dependencies, followed by a 3D refine module for fine-gained volumetric representation. Besides, our method extends differentiable rendering to 3D volumetric flow fields, leveraging zero-shot 2D segmentation and optical flow cues for dynamic decomposition and motion optimization. Extensive experiments on nuScenes and KITTI datasets demonstrate the competitive performance of our approach over prior state-of-the-art methods. Our project page is available at https://eliliu2233.github.io/letoccflow/

Let Occ Flow: Self-Supervised 3D Occupancy Flow Prediction

TL;DR

Let Occ Flow is introduced, the first self-supervised work for joint 3D occupancy and occupancy flow prediction using only camera inputs, eliminating the need for 3D annotations.

Abstract

Accurate perception of the dynamic environment is a fundamental task for autonomous driving and robot systems. This paper introduces Let Occ Flow, the first self-supervised work for joint 3D occupancy and occupancy flow prediction using only camera inputs, eliminating the need for 3D annotations. Utilizing TPV for unified scene representation and deformable attention layers for feature aggregation, our approach incorporates a novel attention-based temporal fusion module to capture dynamic object dependencies, followed by a 3D refine module for fine-gained volumetric representation. Besides, our method extends differentiable rendering to 3D volumetric flow fields, leveraging zero-shot 2D segmentation and optical flow cues for dynamic decomposition and motion optimization. Extensive experiments on nuScenes and KITTI datasets demonstrate the competitive performance of our approach over prior state-of-the-art methods. Our project page is available at https://eliliu2233.github.io/letoccflow/
Paper Structure (31 sections, 17 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 31 sections, 17 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Let Occ Flow introduces a novel self-supervised training paradigm for 3D occupancy and occupancy flow prediction. Unlike earlier studies that relied on expensive annotations for 3D occupancy flow, our method employs readily available 2D labels to train the occupancy flow network.
  • Figure 2: The overall architecture of Let Occ Flow. We employ deformable-attention layers to integrate multi-view image input into TPV representation. The temporal fusion module utilizes BEV-based backward-forward attention to fuse temporal feature volumes. The 3D Refine Module further aggregates spatial features and upsample the fused volume into a high-solution representation. Then we apply two separate MLP decoders to construct volumetric SDF and flow fields, and finally perform self-supervised occupancy flow learning utilizing reprojection consistency, optical flow cues, and optional LiDAR ray supervision via differentiable rendering.
  • Figure 3: Architecture of temporal fusion module. The temporal fusion module consists of ego-motion alignment and temporal feature aggregation. While ego-motion alignment is employed in voxel space to align temporal feature volumes, BEV-based temporal feature aggregation leverages deformable attention with a backward-forward process to achieve temporal interaction.
  • Figure 4: Our effective dynamic disentanglement scheme provides accurate dynamic mask for joint geometry and motion optimization.
  • Figure 5: Visualization results for depth estimation, 3D occupancy and occupancy flow prediction on the KITTI kitti dataset. Our method can predict visually appealing depth maps, fine-grained occupancy, and accurate dynamic decomposition and motion estimation.
  • ...and 6 more figures