Table of Contents
Fetching ...

PVP: Polar Representation Boost for 3D Semantic Occupancy Prediction

Yujing Xue, Jiaxiang Liu, Jiawei Du, Joey Tianyi Zhou

TL;DR

The paper tackles dense 3D semantic occupancy prediction using polar coordinate representations, which suffer from feature distortion and non-uniform voxel distribution. It introduces Polar Voxel Occupancy Predictor (PVP), combining Global Representation Propagation (GRP) and Plane Decomposed Convolution (PD-Conv) to address distortion and enable effective global feature propagation in polar volumes. Through a dual backbonding architecture (3D PD-Conv backbone and 2D image-to-3D backbone) with multimodal fusion, GRP-based long-range attention, and a polar-aware head that converts to Cartesian voxels, PVP achieves substantial improvements on the OpenOccupancy benchmark across input modalities, including LiDAR-only and LiDAR+image setups. The results demonstrate the viability of polar representations for dense 3D occupancy and highlight the practical potential for robust autonomous-driving perception with distorted polar grids.

Abstract

Recently, polar coordinate-based representations have shown promise for 3D perceptual tasks. Compared to Cartesian methods, polar grids provide a viable alternative, offering better detail preservation in nearby spaces while covering larger areas. However, they face feature distortion due to non-uniform division. To address these issues, we introduce the Polar Voxel Occupancy Predictor (PVP), a novel 3D multi-modal predictor that operates in polar coordinates. PVP features two key design elements to overcome distortion: a Global Represent Propagation (GRP) module that integrates global spatial data into 3D volumes, and a Plane Decomposed Convolution (PD-Conv) that simplifies 3D distortions into 2D convolutions. These innovations enable PVP to outperform existing methods, achieving significant improvements in mIoU and IoU metrics on the OpenOccupancy dataset.

PVP: Polar Representation Boost for 3D Semantic Occupancy Prediction

TL;DR

The paper tackles dense 3D semantic occupancy prediction using polar coordinate representations, which suffer from feature distortion and non-uniform voxel distribution. It introduces Polar Voxel Occupancy Predictor (PVP), combining Global Representation Propagation (GRP) and Plane Decomposed Convolution (PD-Conv) to address distortion and enable effective global feature propagation in polar volumes. Through a dual backbonding architecture (3D PD-Conv backbone and 2D image-to-3D backbone) with multimodal fusion, GRP-based long-range attention, and a polar-aware head that converts to Cartesian voxels, PVP achieves substantial improvements on the OpenOccupancy benchmark across input modalities, including LiDAR-only and LiDAR+image setups. The results demonstrate the viability of polar representations for dense 3D occupancy and highlight the practical potential for robust autonomous-driving perception with distorted polar grids.

Abstract

Recently, polar coordinate-based representations have shown promise for 3D perceptual tasks. Compared to Cartesian methods, polar grids provide a viable alternative, offering better detail preservation in nearby spaces while covering larger areas. However, they face feature distortion due to non-uniform division. To address these issues, we introduce the Polar Voxel Occupancy Predictor (PVP), a novel 3D multi-modal predictor that operates in polar coordinates. PVP features two key design elements to overcome distortion: a Global Represent Propagation (GRP) module that integrates global spatial data into 3D volumes, and a Plane Decomposed Convolution (PD-Conv) that simplifies 3D distortions into 2D convolutions. These innovations enable PVP to outperform existing methods, achieving significant improvements in mIoU and IoU metrics on the OpenOccupancy dataset.

Paper Structure

This paper contains 21 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: An illustration of polar feature distortion: The non-uniform division of polar representation causes identical objects (e.g., cars) and scenes (e.g., lane lines) at different ranges and headings to exhibit varied distorted appearances. This leads to global misalignment between objects and scenes and increases the challenge of regression for polar-based Occupancy Prediction.
  • Figure 2: Performance Chart on 3D Occupancy Prediction: Our method PVP achieved the best results. The Intersection over Union (IoU), acting as the geometric metric to distinguish whether a voxel is occupied or empty (with all occupied voxels considered as one category), and the mean IoU (mIoU) across all classes, serving as the semantic metric, were utilized. C&L signifies that the input includes both camera images and LiDAR point clouds. L indicates that the input is exclusively LiDAR point clouds. C represents that the input contains only camera images. C&D means the input comprises both camera images and depth images.
  • Figure 3: The pipeline of PVP. Our proposed PVP consists of three components: 1) The grid-based feature extraction and fusion module, which includes a 3D Voxel-based Backbone with PD-Conv for feature extraction and a 3D Image backbone for 2D to 3D feature conversion. 2) The GRP Module utilizes attention mechanisms to capture road structures from the scene volume and accurately propagate features to their correct locations. 3) The FPN processes dense features for feature aggregation, followed by the task occupancy head. This head comprises an occupancy head and a point cloud refinement module. The primary goal of the task occupancy head is to transform the polar-based 3D tensor into Cartesian voxel output for enhanced results.
  • Figure 4: GRP module encompasses two types of attention sub-modules: 1) Local condense attention for condensing multi-modal local features, and 2) Global decomposed attention, which leverages axial attention across three directions for enhanced feature extraction, followed by cross-attention for the recalibration of long-range feature propagation. After GRP module, the distorted features are corrected.
  • Figure 5: A diagram illustrating PD-Conv. PD-Conv simplifies the handling of complex 3D distortions by substituting traditional 3D convolutions with three separate 2D convolutions. These convolutions facilitate a scale transformation on the range plane, a projection transformation on BEV plane, and an identity transformation on the slicing plane, effectively addressing the local distortion typical of polar volume representations.