Enhancing 3D Semantic Scene Completion with a Refinement Module
Dunxing Zhang, Jiachen Lu, Han Yang, Lei Bao, Bo Song
TL;DR
This work introduces ESSC-RM, a unified refinement framework that enhances 3D semantic scene completion by refining coarse voxel predictions with a model-agnostic 3D U-Net backbone. It integrates Progressive Neighborhood Attention Module (PNAM) for multiscale voxel reasoning and a Visual–Language Guidance Module (VLGM) to inject text-derived priors via Text–Voxel Fusion blocks. The approach demonstrates consistent semantic gains on SemanticKITTI when applied to strong backbones like CGFormer and weaker baselines like MonoScene, while analyzing efficiency and qualitative improvements. By combining geometric refinement with cross-modal priors, ESSC-RM offers a practical, plug-and-play path to improve SSC across heterogeneous architectures. The results suggest promising directions for cross-modal priors and efficient refinement in large-scale 3D scene understanding tasks.
Abstract
We propose ESSC-RM, a plug-and-play Enhancing framework for Semantic Scene Completion with a Refinement Module, which can be seamlessly integrated into existing SSC models. ESSC-RM operates in two phases: a baseline SSC network first produces a coarse voxel prediction, which is subsequently refined by a 3D U-Net-based Prediction Noise-Aware Module (PNAM) and Voxel-level Local Geometry Module (VLGM) under multiscale supervision. Experiments on SemanticKITTI show that ESSC-RM consistently improves semantic prediction performance. When integrated into CGFormer and MonoScene, the mean IoU increases from 16.87% to 17.27% and from 11.08% to 11.51%, respectively. These results demonstrate that ESSC-RM serves as a general refinement framework applicable to a wide range of SSC models.
