LOMA: Language-assisted Semantic Occupancy Network via Triplane Mamba
Yubo Cui, Zhiheng Li, Jiaqiang Wang, Zheng Fang
TL;DR
This work tackles vision-based 3D semantic occupancy prediction by addressing limited geometric cues and high computational costs of global modeling. It introduces LOMA, a vision-language framework that couples a VL-aware Scene Generator with a Multi-scale Tri-plane Fusion Mamba to fuse 3D vision and language features globally and efficiently. Key contributions include voxel-wise 3D language features via VSG and a tri-plane, state-space-based fusion module (TFM) with a multi-scale extension (MS-TFM), validated on SemanticKITTI and SSCBench-KITTI360. Results show state-of-the-art performance in both geometric completion and semantic segmentation, demonstrating the practical value of language priors for 3D perception and paving the way for multi-modal 3D scene understanding.
Abstract
Vision-based 3D occupancy prediction has become a popular research task due to its versatility and affordability. Nowadays, conventional methods usually project the image-based vision features to 3D space and learn the geometric information through the attention mechanism, enabling the 3D semantic occupancy prediction. However, these works usually face two main challenges: 1) Limited geometric information. Due to the lack of geometric information in the image itself, it is challenging to directly predict 3D space information, especially in large-scale outdoor scenes. 2) Local restricted interaction. Due to the quadratic complexity of the attention mechanism, they often use modified local attention to fuse features, resulting in a restricted fusion. To address these problems, in this paper, we propose a language-assisted 3D semantic occupancy prediction network, named LOMA. In the proposed vision-language framework, we first introduce a VL-aware Scene Generator (VSG) module to generate the 3D language feature of the scene. By leveraging the vision-language model, this module provides implicit geometric knowledge and explicit semantic information from the language. Furthermore, we present a Tri-plane Fusion Mamba (TFM) block to efficiently fuse the 3D language feature and 3D vision feature. The proposed module not only fuses the two features with global modeling but also avoids too much computation costs. Experiments on the SemanticKITTI and SSCBench-KITTI360 datasets show that our algorithm achieves new state-of-the-art performances in both geometric and semantic completion tasks. Our code will be open soon.
