Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance
Meng Wang, Fan Wu, Ruihui Li, Yunchuan Qin, Zhuo Tang, Kenli Li
TL;DR
FlowScene introduces optical flow guidance to temporal 3D semantic scene completion, addressing motion context and temporal inconsistency in prior SSC methods. It comprises a Flow-Guided Temporal Aggregation module that aligns and fuses 2D temporal features with a current frame, and an Occlusion-Guided Voxel Refinement module that projectively refines 3D voxels using occlusion masks and aggregated features. The approach achieves state-of-the-art results on SemanticKITTI and SSCBench-KITTI-360, particularly improving dynamic object completion and maintaining geometric accuracy, while using only two historical frames for efficiency. The work demonstrates that incorporating motion cues via optical flow and occlusion-aware fusion yields tangible gains in 3D scene completion for autonomous driving, with potential extensions to multi-camera setups and real-time deployment considerations.
Abstract
3D Semantic Scene Completion (SSC) provides comprehensive scene geometry and semantics for autonomous driving perception, which is crucial for enabling accurate and reliable decision-making. However, existing SSC methods are limited to capturing sparse information from the current frame or naively stacking multi-frame temporal features, thereby failing to acquire effective scene context. These approaches ignore critical motion dynamics and struggle to achieve temporal consistency. To address the above challenges, we propose a novel temporal SSC method FlowScene: Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance. By leveraging optical flow, FlowScene can integrate motion, different viewpoints, occlusions, and other contextual cues, thereby significantly improving the accuracy of 3D scene completion. Specifically, our framework introduces two key components: (1) a Flow-Guided Temporal Aggregation module that aligns and aggregates temporal features using optical flow, capturing motion-aware context and deformable structures; and (2) an Occlusion-Guided Voxel Refinement module that injects occlusion masks and temporally aggregated features into 3D voxel space, adaptively refining voxel representations for explicit geometric modeling. Experimental results demonstrate that FlowScene achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.
