Table of Contents
Fetching ...

DA-Occ: Direction-Aware 2D Convolution for Efficient and Geometry-Preserving 3D Occupancy Prediction

Yuchen Zhou, Yan Luo, Xiaogang Wang, Xingjian Gu, Mingzhou Lu

TL;DR

DA-Occ tackles the real-time, geometry-preserving 3D occupancy prediction problem for autonomous driving by operating in a pure 2D pipeline that retains vertical geometry through height-aware voxel slicing and Direction-Aware Convolution. The method combines a DepthNet–HeightNet based Direction-Aware Geometric Encoder with a Lift-Splat-Shoot inspired 2D-to-3D view transformation and a Direction-Aware Geometric Decoder to fuse height- and BEV-based features, achieving strong accuracy at real-time speeds on Occ3D-nuScenes. Key contributions include height-aware projection, DAC for vertical and horizontal feature extraction, and a joint BEV-height fusion that preserves vertical cues while maintaining efficiency. The approach yields a favorable accuracy–efficiency balance, delivering a high RT-mIoU and demonstrating practical deployment potential for resource-constrained autonomous systems.

Abstract

Efficient and high-accuracy 3D occupancy prediction is crucial for ensuring the performance of autonomous driving (AD) systems. However, many existing methods involve trade-offs between accuracy and efficiency. Some achieve high precision but with slow inference speed, while others adopt purely bird's-eye-view (BEV)-based 2D representations to accelerate processing, inevitably sacrificing vertical cues and compromising geometric integrity. To overcome these limitations, we propose a pure 2D framework that achieves efficient 3D occupancy prediction while preserving geometric integrity. Unlike conventional Lift-Splat-Shoot (LSS) methods that rely solely on depth scores to lift 2D features into 3D space, our approach additionally introduces a height-score projection to encode vertical geometric structure. We further employ direction-aware convolution to extract geometric features along both vertical and horizontal orientations, effectively balancing accuracy and computational efficiency. On the Occ3D-nuScenes, the proposed method achieves an mIoU of 39.3\% and an inference speed of 27.7 FPS, effectively balancing accuracy and efficiency. In simulations on edge devices, the inference speed reaches 14.8 FPS, further demonstrating the method's applicability for real-time deployment in resource-constrained environments.

DA-Occ: Direction-Aware 2D Convolution for Efficient and Geometry-Preserving 3D Occupancy Prediction

TL;DR

DA-Occ tackles the real-time, geometry-preserving 3D occupancy prediction problem for autonomous driving by operating in a pure 2D pipeline that retains vertical geometry through height-aware voxel slicing and Direction-Aware Convolution. The method combines a DepthNet–HeightNet based Direction-Aware Geometric Encoder with a Lift-Splat-Shoot inspired 2D-to-3D view transformation and a Direction-Aware Geometric Decoder to fuse height- and BEV-based features, achieving strong accuracy at real-time speeds on Occ3D-nuScenes. Key contributions include height-aware projection, DAC for vertical and horizontal feature extraction, and a joint BEV-height fusion that preserves vertical cues while maintaining efficiency. The approach yields a favorable accuracy–efficiency balance, delivering a high RT-mIoU and demonstrating practical deployment potential for resource-constrained autonomous systems.

Abstract

Efficient and high-accuracy 3D occupancy prediction is crucial for ensuring the performance of autonomous driving (AD) systems. However, many existing methods involve trade-offs between accuracy and efficiency. Some achieve high precision but with slow inference speed, while others adopt purely bird's-eye-view (BEV)-based 2D representations to accelerate processing, inevitably sacrificing vertical cues and compromising geometric integrity. To overcome these limitations, we propose a pure 2D framework that achieves efficient 3D occupancy prediction while preserving geometric integrity. Unlike conventional Lift-Splat-Shoot (LSS) methods that rely solely on depth scores to lift 2D features into 3D space, our approach additionally introduces a height-score projection to encode vertical geometric structure. We further employ direction-aware convolution to extract geometric features along both vertical and horizontal orientations, effectively balancing accuracy and computational efficiency. On the Occ3D-nuScenes, the proposed method achieves an mIoU of 39.3\% and an inference speed of 27.7 FPS, effectively balancing accuracy and efficiency. In simulations on edge devices, the inference speed reaches 14.8 FPS, further demonstrating the method's applicability for real-time deployment in resource-constrained environments.

Paper Structure

This paper contains 17 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The inference speed (FPS) and accuracy (mIoU) of various methods are evaluated on the Occ3D-nuScenes benchmark tian2023occ3d. Following the definition proposed by hou2024fastocc, we consider an occupancy prediction method to be real-time if it achieves at least 10 FPS.
  • Figure 2: The top section illustrates traditional 2D BEV methods lead to the collapse of geometric structures, particularly in the vertical direction, during compression. The bottom section demonstrates how our approach effectively preserves these structures while maintaining efficiency, even in a purely 2D framework.
  • Figure 3: This diagram illustrates the overall architecture of DA-Occ. The left side shows the input images processed by the Backbone, generating feature maps $\mathbf{F}_n$ that are fed into the DepthNet and HeightNet for depth and height predictions. These features are then used to construct 3D features $\mathbf{F}_{3D}$ (with height) and BEV features $\mathbf{F}_{bev}$ (without height). The DAC ($\mathcal{D}_{v}(\cdot)$ and $\mathcal{D}_{h}(\cdot)$) are applied to enhance the feature representation. Finally, these features are fused to produce the final output, which is visualized on the right side. (To facilitate understanding, some feature maps use the original images instead.)
  • Figure 4: Internal operations of Direction-Aware 2D Convolution. This process first performs directional average value compression on the input feature $\mathbf{F}_{in}$ (in the horizontal or vertical direction) to generate one of $\mathbf{F}_h$ or $\mathbf{F}_v$. This intermediate result is passed through a Multi-Layer Perceptron (MLP) to generate dynamic convolutional weights. These weights are then applied to a concatenated feature tensor via convolution, producing the final output feature $\mathbf{F}_{out}$.
  • Figure 5: It illustrates the combined effects of $\mathcal{D}_{v}$ in $\mathbf{F}_{height}$, and $\mathcal{D}_{v}$ and $\mathcal{D}_{h}$ in $\mathbf{F}_{bev}$. The right side presents an equivalent depiction of these effects, highlighting the collaborative extraction of geometric features along the $X$, $Y$, and $Z$ -axes, and emphasizing the synergy between the three axes in capturing spatial information.
  • ...and 1 more figures