Table of Contents
Fetching ...

SelfOccFlow: Towards end-to-end self-supervised 3D Occupancy Flow prediction

Xavier Timoneda, Markus Herb, Fabian Duerr, Daniel Goehring

TL;DR

This work proposes a self-supervised method for 3D occupancy flow estimation that eliminates the need for human-produced annotations or external flow supervision and introduces a strong self-supervised flow cue derived from features' cosine similarities.

Abstract

Estimating 3D occupancy and motion at the vehicle's surroundings is essential for autonomous driving, enabling situational awareness in dynamic environments. Existing approaches jointly learn geometry and motion but rely on expensive 3D occupancy and flow annotations, velocity labels from bounding boxes, or pretrained optical flow models. We propose a self-supervised method for 3D occupancy flow estimation that eliminates the need for human-produced annotations or external flow supervision. Our method disentangles the scene into separate static and dynamic signed distance fields and learns motion implicitly through temporal aggregation. Additionally, we introduce a strong self-supervised flow cue derived from features' cosine similarities. We demonstrate the efficacy of our 3D occupancy flow method on SemanticKITTI, KITTI-MOT, and nuScenes.

SelfOccFlow: Towards end-to-end self-supervised 3D Occupancy Flow prediction

TL;DR

This work proposes a self-supervised method for 3D occupancy flow estimation that eliminates the need for human-produced annotations or external flow supervision and introduces a strong self-supervised flow cue derived from features' cosine similarities.

Abstract

Estimating 3D occupancy and motion at the vehicle's surroundings is essential for autonomous driving, enabling situational awareness in dynamic environments. Existing approaches jointly learn geometry and motion but rely on expensive 3D occupancy and flow annotations, velocity labels from bounding boxes, or pretrained optical flow models. We propose a self-supervised method for 3D occupancy flow estimation that eliminates the need for human-produced annotations or external flow supervision. Our method disentangles the scene into separate static and dynamic signed distance fields and learns motion implicitly through temporal aggregation. Additionally, we introduce a strong self-supervised flow cue derived from features' cosine similarities. We demonstrate the efficacy of our 3D occupancy flow method on SemanticKITTI, KITTI-MOT, and nuScenes.
Paper Structure (5 sections, 12 equations, 5 figures, 5 tables)

This paper contains 5 sections, 12 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Similarity flow self-supervision on SemanticKITTI. We provide explicit flow supervision by comparing the dynamic BEV features at current time $t$ and its adjacent frames $t\!\pm\!1$ aligned by ego motion. The flow pseudo-labels $\mathbf{f}^{s}_{t^-}, \mathbf{f}^{s}_{t^+}$ are obtained from the cosine similarities between each cell in the current timestep and its $N\!\times\!N$ neighbors at $t\!\pm\!1$.
  • Figure 2: Overview of our method.a)Occupancy-Flow model. Baseline pipeline used for both training and inference. The multi-view images from time $t$ are processed by a 2D backbone, and their features are fused in the 3D encoder. Half of the encoder's attention heads produce static BEV features $\beta^{s}_{t}$, and the other half produce dynamic ones $\beta^{d}_{t}$. Both $\beta^{s}_{t}$ and $\beta^{d}_{t}$ are fused with their previous frame features $\beta^{s}_{t-1}$, $\beta^{d}_{t-1}$ aligned by ego motion. The fused static features $\beta^{s}_{t_{\text{hist}}}$ are used to predict the static SDF $\phi^{s}_{t}$, while the fused dynamic features $\beta^{d}_{t_{\text{hist}}}$ are used to predict the dynamic SDF $\phi^{d}_{t}$, and flow $\mathbf{f}$. b)Temporal aggregation. Static field predictions $\phi^{s}_{t}$ are aggregated with $\phi^{s}_{t-1}$ and $\phi^{s}_{t+1}$, aligned by ego motion. For the dynamic field $\phi^{d}_{t}$, the sampling positions at $\phi^{d}_{t-1}$ and $\phi^{d}_{t+1}$ are first warped by the flow $\mathbf{f}$, enabling implicit flow learning. c)Ray-based supervision of aggregated fields $\bar{\phi}^{s}_t, \bar{\phi}^{d}_t$ using photometric $\mathcal{L}_{photo}$ and LiDAR $\mathcal{L}_{lidar}$ losses. $\mathcal{L}_{photo}$ is applied to the blended field $\phi^{b}_{t}$ using camera rays from time $t$ only. $\mathcal{L}_{lidar}$ supervises dynamic predictions $\phi^{d}_{t}$ using dynamic LiDAR rays from $t$, and static predictions $\phi^{s}_{t}$ using static LiDAR rays from $t \pm k$ neighbors, leveraging its stationary nature. d)Similarity flow supervision using auto-generated flow pseudo-labels $\mathbf{f}^{s}_{t^-}, \mathbf{f}^{s}_{t^+}$, obtained as the $\text{argmax}$ of the cosine similarity between $\beta^{d}_{t}$ and $\beta^{d}_{t\pm1}$ over $N\!\times\!N$ neighboring cells.
  • Figure 3: 3D occupancy comparison on SemanticKITTI SemanticKITTI. We draw white boxes on areas with notable changes. Our model predicts better occupancy for small dynamic objects, and infers better geometries in occluded regions such as behind cars.
  • Figure 4: 3D occupancy flow results on KITTI-MOT KITTIMOT. Flow legend displayed at the top right corner.
  • Figure 5: 3D occupancy flow results on nuScenes nuScenes. Flow colors are the same as in KITTI-MOT.