Table of Contents
Fetching ...

CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

Zhangchen Ye, Tao Jiang, Chenfeng Xu, Yiming Li, Hang Zhao

TL;DR

The paper tackles the challenge of predicting 3D semantic occupancy from monocular vision, where depth ambiguity limits accuracy. It introduces CVT-Occ, a Cost Volume Temporal Module that samples along each voxel's line of sight and aggregates features from $K-1$ historical frames to form a 3D cost volume $\mathbf{F}$ for refining the current voxel features. The method integrates with a BEV-to-volume pipeline and occupancy decoder, trained with a multi-task loss $\mathcal{L} = \mathcal{L}_{\text{occ}} + \lambda \mathcal{L}_{\text{cvt}}$, combining cross-entropy for semantics and a binary cross-entropy on the cost volume weights. On Occ3D-Waymo, CVT-Occ achieves state-of-the-art $mIoU$ with modest additional cost, validating that explicit temporal parallax in 3D space can substantially improve visual 3D perception for autonomous driving.

Abstract

Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost. The code is released at \url{https://github.com/Tsinghua-MARS-Lab/CVT-Occ}.

CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

TL;DR

The paper tackles the challenge of predicting 3D semantic occupancy from monocular vision, where depth ambiguity limits accuracy. It introduces CVT-Occ, a Cost Volume Temporal Module that samples along each voxel's line of sight and aggregates features from historical frames to form a 3D cost volume for refining the current voxel features. The method integrates with a BEV-to-volume pipeline and occupancy decoder, trained with a multi-task loss , combining cross-entropy for semantics and a binary cross-entropy on the cost volume weights. On Occ3D-Waymo, CVT-Occ achieves state-of-the-art with modest additional cost, validating that explicit temporal parallax in 3D space can substantially improve visual 3D perception for autonomous driving.

Abstract

Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost. The code is released at \url{https://github.com/Tsinghua-MARS-Lab/CVT-Occ}.
Paper Structure (20 sections, 6 equations, 5 figures, 3 tables)

This paper contains 20 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of Temporal Fusion Methods. Illustrated are four key approaches: (1) Temporal Self-Attention li2022bevformer, leveraging attention mechanisms for temporal integration; (2) Warp and Concat huang2022bevdet4dyang2022bevformerv2wang2023panooccunifiedoccupancyrepresentation, combining features across frames and fusing them through convolution; (3) Cost Volume Construction in image space solofusion, constructing cost volume from image input of different frames and leveraging plane-sweep volumes for depth map generation; and (4) Our Proposed Method, which involves constructing a temporal cost volume in 3D space to enhance feature refinement. In the figure, Ⓐ and $\otimes$ represent coordinate alignment and element-wise product, accordingly.
  • Figure 2: Overall Architecture of CVT-Occ. The image backbone extracts multi-scale features from multi-view images, which are transformed into 3D volume features denoted as $\mathbf{V} \in \mathbb{R}^{H \times W \times Z \times C}$. The Cost Volume Temporal Module samples points along the line of sight within the current volume and projects them onto $K-1$ historical frames, resulting in $K \times N$ 3D volume features. These features are concatenated to construct cost volume features $\mathbf{F} \in \mathbb{R}^{H \times W \times Z \times (K\times N) \times C}$. Convolution layers are then applied to generate weights $\mathbf{W} \in \mathbb{R}^{H \times W \times Z}$, refining the depth of 3D voxel. Finally, an occupancy decoder produces 3D semantic occupancy predictions. In the figure, Ⓐ, Ⓒ, and $\otimes$ represent coordinate alignment, concatenation, and element-wise product, respectively.
  • Figure 2: Ablation Experiments. We evaluate CVT-Occ under different frame specifications and CVT supervision.
  • Figure 3: Experiments of Different Time Spans.
  • Figure 4: Qualitative Results. CVT-Occ exhibits superior performance in predicting the occupancy of vegetation and buildings.