Table of Contents
Fetching ...

Towards Temporal Fusion Beyond the Field of View for Camera-based Semantic Scene Completion

Jongseong Bae, Junwoo Ha, Jinnyeong Heo, Yeongin Lee, Ha Young Kim

TL;DR

This work tackles the challenge of completing 3D scenes beyond the current camera view in camera-based semantic scene completion. It introduces C3DFusion, a temporal geometry fusion module that directly aligns and fuses 3D lifted point features from current and past frames in the current frame's metric space, complemented by historical context blurring and current-centric feature densification to reduce noise and emphasize current information. The approach yields state-of-the-art results on SemanticKITTI and SSCBench-KITTI-360 and generalizes well to other SSC architectures, with notable improvements in out-of-view regions. By enabling robust out-of-frame completion with an efficient, generalizable design, C3DFusion has strong potential to enhance perception reliability in autonomous driving and related 3D perception tasks.

Abstract

Recent camera-based 3D semantic scene completion (SSC) methods have increasingly explored leveraging temporal cues to enrich the features of the current frame. However, while these approaches primarily focus on enhancing in-frame regions, they often struggle to reconstruct critical out-of-frame areas near the sides of the ego-vehicle, although previous frames commonly contain valuable contextual information about these unseen regions. To address this limitation, we propose the Current-Centric Contextual 3D Fusion (C3DFusion) module, which generates hidden region-aware 3D feature geometry by explicitly aligning 3D-lifted point features from both current and historical frames. C3DFusion performs enhanced temporal fusion through two complementary techniques-historical context blurring and current-centric feature densification-which suppress noise from inaccurately warped historical point features by attenuating their scale, and enhance current point features by increasing their volumetric contribution. Simply integrated into standard SSC architectures, C3DFusion demonstrates strong effectiveness, significantly outperforming state-of-the-art methods on the SemanticKITTI and SSCBench-KITTI-360 datasets. Furthermore, it exhibits robust generalization, achieving notable performance gains when applied to other baseline models.

Towards Temporal Fusion Beyond the Field of View for Camera-based Semantic Scene Completion

TL;DR

This work tackles the challenge of completing 3D scenes beyond the current camera view in camera-based semantic scene completion. It introduces C3DFusion, a temporal geometry fusion module that directly aligns and fuses 3D lifted point features from current and past frames in the current frame's metric space, complemented by historical context blurring and current-centric feature densification to reduce noise and emphasize current information. The approach yields state-of-the-art results on SemanticKITTI and SSCBench-KITTI-360 and generalizes well to other SSC architectures, with notable improvements in out-of-view regions. By enabling robust out-of-frame completion with an efficient, generalizable design, C3DFusion has strong potential to enhance perception reliability in autonomous driving and related 3D perception tasks.

Abstract

Recent camera-based 3D semantic scene completion (SSC) methods have increasingly explored leveraging temporal cues to enrich the features of the current frame. However, while these approaches primarily focus on enhancing in-frame regions, they often struggle to reconstruct critical out-of-frame areas near the sides of the ego-vehicle, although previous frames commonly contain valuable contextual information about these unseen regions. To address this limitation, we propose the Current-Centric Contextual 3D Fusion (C3DFusion) module, which generates hidden region-aware 3D feature geometry by explicitly aligning 3D-lifted point features from both current and historical frames. C3DFusion performs enhanced temporal fusion through two complementary techniques-historical context blurring and current-centric feature densification-which suppress noise from inaccurately warped historical point features by attenuating their scale, and enhance current point features by increasing their volumetric contribution. Simply integrated into standard SSC architectures, C3DFusion demonstrates strong effectiveness, significantly outperforming state-of-the-art methods on the SemanticKITTI and SSCBench-KITTI-360 datasets. Furthermore, it exhibits robust generalization, achieving notable performance gains when applied to other baseline models.

Paper Structure

This paper contains 44 sections, 10 equations, 6 figures, 15 tables, 2 algorithms.

Figures (6)

  • Figure 1: Existing temporal fusion models struggle to complete out-of-frame geometry in the current frame. For example, HTCL-S li2024hierarchical, a recent method that performs temporal fusion via 2D feature warping, fails to recover the car on the left side despite its visibility in previous frames, resulting in performance comparable to that of the single-frame-based CGFormer yu2024context.
  • Figure 2: An overview of our model, highlighting the proposed C3DFusion. The symbol '$\oplus$' denotes feature concatenation.
  • Figure 3: Visual comparison of our model against other recent camera-based methods on the SemanticKITTI validation set.
  • Figure A.1: Visual comparison between our model with C3DFusion and the baseline using temporal LSS fusion.
  • Figure A.2: Visual comparison of backprojected 3D point clouds of the current frame with varying interpolation factors for current-centric feature densification.
  • ...and 1 more figures