Table of Contents
Fetching ...

Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion

Bohan Li, Jiajun Deng, Wenyao Zhang, Zhujin Liang, Dalong Du, Xin Jin, Wenjun Zeng

TL;DR

HTCL introduces the pattern affinity with scale-aware isolation and multiple independent learners for fine-grained contextual correspondence modeling and adaptively refine the feature sampling locations based on initially identified locations with high affinity and their neighboring relevant regions.

Abstract

Camera-based 3D semantic scene completion (SSC) is pivotal for predicting complicated 3D layouts with limited 2D image observations. The existing mainstream solutions generally leverage temporal information by roughly stacking history frames to supplement the current frame, such straightforward temporal modeling inevitably diminishes valid clues and increases learning difficulty. To address this problem, we present HTCL, a novel Hierarchical Temporal Context Learning paradigm for improving camera-based semantic scene completion. The primary innovation of this work involves decomposing temporal context learning into two hierarchical steps: (a) cross-frame affinity measurement and (b) affinity-based dynamic refinement. Firstly, to separate critical relevant context from redundant information, we introduce the pattern affinity with scale-aware isolation and multiple independent learners for fine-grained contextual correspondence modeling. Subsequently, to dynamically compensate for incomplete observations, we adaptively refine the feature sampling locations based on initially identified locations with high affinity and their neighboring relevant regions. Our method ranks $1^{st}$ on the SemanticKITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU on the OpenOccupancy benchmark. Our code is available on https://github.com/Arlo0o/HTCL.

Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion

TL;DR

HTCL introduces the pattern affinity with scale-aware isolation and multiple independent learners for fine-grained contextual correspondence modeling and adaptively refine the feature sampling locations based on initially identified locations with high affinity and their neighboring relevant regions.

Abstract

Camera-based 3D semantic scene completion (SSC) is pivotal for predicting complicated 3D layouts with limited 2D image observations. The existing mainstream solutions generally leverage temporal information by roughly stacking history frames to supplement the current frame, such straightforward temporal modeling inevitably diminishes valid clues and increases learning difficulty. To address this problem, we present HTCL, a novel Hierarchical Temporal Context Learning paradigm for improving camera-based semantic scene completion. The primary innovation of this work involves decomposing temporal context learning into two hierarchical steps: (a) cross-frame affinity measurement and (b) affinity-based dynamic refinement. Firstly, to separate critical relevant context from redundant information, we introduce the pattern affinity with scale-aware isolation and multiple independent learners for fine-grained contextual correspondence modeling. Subsequently, to dynamically compensate for incomplete observations, we adaptively refine the feature sampling locations based on initially identified locations with high affinity and their neighboring relevant regions. Our method ranks on the SemanticKITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU on the OpenOccupancy benchmark. Our code is available on https://github.com/Arlo0o/HTCL.
Paper Structure (24 sections, 12 equations, 12 figures, 7 tables)

This paper contains 24 sections, 12 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Our hierarchical temporal context learning method versus previous straightforward temporal method (VoxFormer-T li2023voxformer) in semantic scene completion.
  • Figure 2: (a) Overview pipeline of the proposed method, which measures contextual pattern affinity across temporal frames and dynamically samples relevant context. Our method shows promising performance in comprehending and completing semantic scenes even outside the camera's field of view, as indicated by the car highlighted with the yellow box. (b) Comparison with state-of-the-art camera-based semantic scene completion methods li2023voxformerzhang2023occformerhuang2023tricao2022monoscenewei2023surroundocc on the SemanticKITTI test set.
  • Figure 3: Overall framework of our proposed method. Given temporal RGB images, the Aligned Temporal Volume is constructed with explicit epipolar homograph warping, while the Voxel Feature Volume is built by extending the LSS paradigm. Afterward, the Reliable Temporal Aggregation is introduced to dynamically aggregate reliable relevant temporal content for fine-grained semantic scene prediction.
  • Figure 4: Visualization of the heat maps from our proposed Cross-frame Pattern Affinity (CPA) and the original cosine similarity.
  • Figure 5: Qualitative results on the SemanticKITTI validation set. Our proposed HTCL captures more complete and accurate scenery layouts compared with VoxFormer. Meanwhile, HTCL hallucinates more proper scenery beyond the camera field of view.
  • ...and 7 more figures