Table of Contents
Fetching ...

Three Cars Approaching within 100m! Enhancing Distant Geometry by Tri-Axis Voxel Scanning for Camera-based Semantic Scene Completion

Jongseong Bae, Junwoo Ha, Ha Young Kim

TL;DR

Camera-based SSC often underestimates distant geometry due to perspective and occlusion. The authors propose ScanSSC, which refines distant voxels by propagating near-viewpoint context through an axis-wise masked self-attention Scan Module and a Scan Loss that uses cumulatively averaged logits to transfer information along depth, width, and height axes. The approach yields state-of-the-art results on SemanticKITTI and SSCBench-KITTI-360, with clear ablation-supported gains from both components and their synergy. This work addresses distance-dependent completion imbalance and offers a practical, robust mechanism for improving 3D scene completion in autonomous driving contexts. Overall, ScanSSC demonstrates that targeted near-to-far refinements can substantially boost distant geometry reconstruction while maintaining efficiency.

Abstract

Camera-based Semantic Scene Completion (SSC) is gaining attentions in the 3D perception field. However, properties such as perspective and occlusion lead to the underestimation of the geometry in distant regions, posing a critical issue for safety-focused autonomous driving systems. To tackle this, we propose ScanSSC, a novel camera-based SSC model composed of a Scan Module and Scan Loss, both designed to enhance distant scenes by leveraging context from near-viewpoint scenes. The Scan Module uses axis-wise masked attention, where each axis employing a near-to-far cascade masking that enables distant voxels to capture relationships with preceding voxels. In addition, the Scan Loss computes the cross-entropy along each axis between cumulative logits and corresponding class distributions in a near-to-far direction, thereby propagating rich context-aware signals to distant voxels. Leveraging the synergy between these components, ScanSSC achieves state-of-the-art performance, with IoUs of 44.54 and 48.29, and mIoUs of 17.40 and 20.14 on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.

Three Cars Approaching within 100m! Enhancing Distant Geometry by Tri-Axis Voxel Scanning for Camera-based Semantic Scene Completion

TL;DR

Camera-based SSC often underestimates distant geometry due to perspective and occlusion. The authors propose ScanSSC, which refines distant voxels by propagating near-viewpoint context through an axis-wise masked self-attention Scan Module and a Scan Loss that uses cumulatively averaged logits to transfer information along depth, width, and height axes. The approach yields state-of-the-art results on SemanticKITTI and SSCBench-KITTI-360, with clear ablation-supported gains from both components and their synergy. This work addresses distance-dependent completion imbalance and offers a practical, robust mechanism for improving 3D scene completion in autonomous driving contexts. Overall, ScanSSC demonstrates that targeted near-to-far refinements can substantially boost distant geometry reconstruction while maintaining efficiency.

Abstract

Camera-based Semantic Scene Completion (SSC) is gaining attentions in the 3D perception field. However, properties such as perspective and occlusion lead to the underestimation of the geometry in distant regions, posing a critical issue for safety-focused autonomous driving systems. To tackle this, we propose ScanSSC, a novel camera-based SSC model composed of a Scan Module and Scan Loss, both designed to enhance distant scenes by leveraging context from near-viewpoint scenes. The Scan Module uses axis-wise masked attention, where each axis employing a near-to-far cascade masking that enables distant voxels to capture relationships with preceding voxels. In addition, the Scan Loss computes the cross-entropy along each axis between cumulative logits and corresponding class distributions in a near-to-far direction, thereby propagating rich context-aware signals to distant voxels. Leveraging the synergy between these components, ScanSSC achieves state-of-the-art performance, with IoUs of 44.54 and 48.29, and mIoUs of 17.40 and 20.14 on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.

Paper Structure

This paper contains 21 sections, 10 equations, 12 figures, 11 tables, 2 algorithms.

Figures (12)

  • Figure 1: In camera-based SSC, the projected distant geometry is sparse and unrealistic due to factors such as perspective and occlusion. Our ScanSSC addresses this challenge by referencing distant geometry to the context of more accurate near-viewpoint geometry. As a result, ScanSSC achieves more accurate reconstructions of both distant and near scenes, outperforming existing camera-based SSC methods such as VoxFormer-T li2023voxformer and CGFormer yu2024contextgeometryawarevoxel.
  • Figure 2: Axis-wise trends in recall, IoU, and mIoU for VoxFormer li2023voxformer, along with the ground-truth occupied voxel distributions, on the SemanticKITTI behley2019semantickitti validation data. All figures are presented in a graph, scaled between 0 and 1. Each graph is binned with sizes of 256, 256, and 32 for the depth, width, and height axes, respectively. (1) to (4) represent each segment when the axis is divided into four equal parts.
  • Figure 3: The overall architecture of the proposed ScanSSC. After ${\textbf{F}}^{3D}$ is obtained through the viewing transformation, it is passed through the three parallel Scan blocks of the Scan Module. Each block performs masked self-attention along the axis highlighted in red. The purple arrows indicate the 'near-to-far' direction, implemented by the corresponding mask below. ${\textbf{Q}}_{axis}$, ${\textbf{K}}_{axis}$, $\textbf{V}_{axis}$, and $\textbf{Z}_{axis}$ denote the query, key, value, and output features of attention, respectively, where ${axis} \in \{{dep}, {wid}, {hgt}\}$.
  • Figure 4: Visual overview of the axis-wise cumulative voxel averaging in Scan Loss. Each voxel represents a predicted class logit.
  • Figure 5: Axis-wise trends in mIoU for VoxFormer li2023voxformer, CGFormer yu2024contextgeometryawarevoxel, and ScanSSC on the SemanticKITTI behley2019semantickitti validation data. Each graph is binned with sizes of 256, 256, and 32 for the depth, width, and height axes, respectively.
  • ...and 7 more figures