Table of Contents
Fetching ...

Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction

Bohan Li, Jiajun Deng, Yasheng Sun, Xiaofeng Wang, Xin Jin, Wenjun Zeng

TL;DR

The paper tackles camera-based semantic occupancy prediction by addressing misalignment in context fusion. It introduces Hi-SOP, a hierarchical framework that first disentangles geometric and temporal contexts using modules like Geometric Confidence-aware Lifting, Cross-frame Pattern Affinity, and Affinity-based Dynamic Refinement, then globally composes them via Depth-Hypothesis-Based Transformation. The approach achieves state-of-the-art results on SemanticKITTI, NuScenes-Occupancy, and NuScenes LiDAR segmentation benchmarks, often surpassing LiDAR-based methods in SSC tasks. This work enhances 3D scene understanding for autonomous driving using camera inputs by delivering more reliable, dense semantic occupancy predictions and more stable learning dynamics.

Abstract

Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations. Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning, alleviating issues like occlusion or ambiguity. However, these solutions often face misalignment issues wherein the corresponding features at the same position across different frames may have different semantic meanings during the aggregation process, which leads to unreliable contextual fusion results and an unstable representation learning process. To address this problem, we introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP). Hi-SOP first disentangles the geometric and temporal context for separate alignment, which two branches are then composed to enhance the reliability of SOP. This parsing of the visual input into a local-global alignment hierarchy includes: (I) disentangled geometric and temporal separate alignment, within each leverages depth confidence and camera pose as prior for relevant feature matching respectively; (II) global alignment and composition of the transformed geometric and temporal volumes based on semantics consistency. Our method outperforms SOTAs for semantic scene completion on the SemanticKITTI & NuScenes-Occupancy datasets and LiDAR semantic segmentation on the NuScenes dataset. The project website is available at https://arlo0o.github.io/hisop.github.io/.

Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction

TL;DR

The paper tackles camera-based semantic occupancy prediction by addressing misalignment in context fusion. It introduces Hi-SOP, a hierarchical framework that first disentangles geometric and temporal contexts using modules like Geometric Confidence-aware Lifting, Cross-frame Pattern Affinity, and Affinity-based Dynamic Refinement, then globally composes them via Depth-Hypothesis-Based Transformation. The approach achieves state-of-the-art results on SemanticKITTI, NuScenes-Occupancy, and NuScenes LiDAR segmentation benchmarks, often surpassing LiDAR-based methods in SSC tasks. This work enhances 3D scene understanding for autonomous driving using camera inputs by delivering more reliable, dense semantic occupancy predictions and more stable learning dynamics.

Abstract

Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations. Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning, alleviating issues like occlusion or ambiguity. However, these solutions often face misalignment issues wherein the corresponding features at the same position across different frames may have different semantic meanings during the aggregation process, which leads to unreliable contextual fusion results and an unstable representation learning process. To address this problem, we introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP). Hi-SOP first disentangles the geometric and temporal context for separate alignment, which two branches are then composed to enhance the reliability of SOP. This parsing of the visual input into a local-global alignment hierarchy includes: (I) disentangled geometric and temporal separate alignment, within each leverages depth confidence and camera pose as prior for relevant feature matching respectively; (II) global alignment and composition of the transformed geometric and temporal volumes based on semantics consistency. Our method outperforms SOTAs for semantic scene completion on the SemanticKITTI & NuScenes-Occupancy datasets and LiDAR semantic segmentation on the NuScenes dataset. The project website is available at https://arlo0o.github.io/hisop.github.io/.

Paper Structure

This paper contains 25 sections, 17 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Comparison of our hierarchical context alignment method with prior geometric modeling (e.g., OccFormer zhang2023occformer) and temporal modeling approaches (e.g., VoxFormer-T li2023voxformer) for semantic occupancy prediction. Previous methods handle either geometric lifting or temporal stacking separately, often fusing contexts trivially in a black-box manner, which leads to misalignment and unreliable fusion. In contrast, our approach integrates both geometric and temporal representations in a hierarchically aligned manner, enabling robust contextual composition.
  • Figure 2: (a) Visualization examples of the misalignment issue. The same car appears at different locations across multiple frames. Directly fusing contextual information from these frames neglects positional shifts and scale variations, which could lead to ambiguous contextual aggregation and unstable representation learning for semantic occupancy prediction. (b) The effect of the hierarchical context alignment on the SemanticKITTI validation set. We remove both the temporal alignment and the geometric alignment to implement the setting of 'w/o align'. The proposed hierarchical context alignment strategy captures more reliable and comprehensive semantic scenes, and leads to more stable representation modeling in the learning process.
  • Figure 3: Overall framework of our proposed hierarchical context alignment scheme, which is composed of the Geometric Alignment, the Temporal Alignment, and the Global Composition. The Geometric Confidence-awareness Lifting (GCL) module is introduced to facilitate explicit geometric alignment with depth distribution confidence. The Cross-frame Pattern Affinity (CPA) measurement and Affinity-based Dynamic Refinement (ADR) module are presented to quantify the regional contextual relevance and dynamically refine the feature sampling locations based on the relevance information, respectively. Afterward, the Global Composition with the Depth-Hypothesis-Based Transformation (DHBT) module is introduced to aggregate the disentangled relevant content for reliable fine-grained SOP.
  • Figure 4: The structure of the proposed Geometric Confidence-aware Lifting (GCL) module, which explicitly models the geometric information with depth distribution confidence.
  • Figure 5: The structure of the proposed Cross-frame Pattern Affinity (CPA) measurement, which is proposed to quantify the regional contextual correspondence within the temporal feature volume.
  • ...and 6 more figures