Table of Contents
Fetching ...

LOMA: Language-assisted Semantic Occupancy Network via Triplane Mamba

Yubo Cui, Zhiheng Li, Jiaqiang Wang, Zheng Fang

TL;DR

This work tackles vision-based 3D semantic occupancy prediction by addressing limited geometric cues and high computational costs of global modeling. It introduces LOMA, a vision-language framework that couples a VL-aware Scene Generator with a Multi-scale Tri-plane Fusion Mamba to fuse 3D vision and language features globally and efficiently. Key contributions include voxel-wise 3D language features via VSG and a tri-plane, state-space-based fusion module (TFM) with a multi-scale extension (MS-TFM), validated on SemanticKITTI and SSCBench-KITTI360. Results show state-of-the-art performance in both geometric completion and semantic segmentation, demonstrating the practical value of language priors for 3D perception and paving the way for multi-modal 3D scene understanding.

Abstract

Vision-based 3D occupancy prediction has become a popular research task due to its versatility and affordability. Nowadays, conventional methods usually project the image-based vision features to 3D space and learn the geometric information through the attention mechanism, enabling the 3D semantic occupancy prediction. However, these works usually face two main challenges: 1) Limited geometric information. Due to the lack of geometric information in the image itself, it is challenging to directly predict 3D space information, especially in large-scale outdoor scenes. 2) Local restricted interaction. Due to the quadratic complexity of the attention mechanism, they often use modified local attention to fuse features, resulting in a restricted fusion. To address these problems, in this paper, we propose a language-assisted 3D semantic occupancy prediction network, named LOMA. In the proposed vision-language framework, we first introduce a VL-aware Scene Generator (VSG) module to generate the 3D language feature of the scene. By leveraging the vision-language model, this module provides implicit geometric knowledge and explicit semantic information from the language. Furthermore, we present a Tri-plane Fusion Mamba (TFM) block to efficiently fuse the 3D language feature and 3D vision feature. The proposed module not only fuses the two features with global modeling but also avoids too much computation costs. Experiments on the SemanticKITTI and SSCBench-KITTI360 datasets show that our algorithm achieves new state-of-the-art performances in both geometric and semantic completion tasks. Our code will be open soon.

LOMA: Language-assisted Semantic Occupancy Network via Triplane Mamba

TL;DR

This work tackles vision-based 3D semantic occupancy prediction by addressing limited geometric cues and high computational costs of global modeling. It introduces LOMA, a vision-language framework that couples a VL-aware Scene Generator with a Multi-scale Tri-plane Fusion Mamba to fuse 3D vision and language features globally and efficiently. Key contributions include voxel-wise 3D language features via VSG and a tri-plane, state-space-based fusion module (TFM) with a multi-scale extension (MS-TFM), validated on SemanticKITTI and SSCBench-KITTI360. Results show state-of-the-art performance in both geometric completion and semantic segmentation, demonstrating the practical value of language priors for 3D perception and paving the way for multi-modal 3D scene understanding.

Abstract

Vision-based 3D occupancy prediction has become a popular research task due to its versatility and affordability. Nowadays, conventional methods usually project the image-based vision features to 3D space and learn the geometric information through the attention mechanism, enabling the 3D semantic occupancy prediction. However, these works usually face two main challenges: 1) Limited geometric information. Due to the lack of geometric information in the image itself, it is challenging to directly predict 3D space information, especially in large-scale outdoor scenes. 2) Local restricted interaction. Due to the quadratic complexity of the attention mechanism, they often use modified local attention to fuse features, resulting in a restricted fusion. To address these problems, in this paper, we propose a language-assisted 3D semantic occupancy prediction network, named LOMA. In the proposed vision-language framework, we first introduce a VL-aware Scene Generator (VSG) module to generate the 3D language feature of the scene. By leveraging the vision-language model, this module provides implicit geometric knowledge and explicit semantic information from the language. Furthermore, we present a Tri-plane Fusion Mamba (TFM) block to efficiently fuse the 3D language feature and 3D vision feature. The proposed module not only fuses the two features with global modeling but also avoids too much computation costs. Experiments on the SemanticKITTI and SSCBench-KITTI360 datasets show that our algorithm achieves new state-of-the-art performances in both geometric and semantic completion tasks. Our code will be open soon.

Paper Structure

This paper contains 25 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (a) Previous vision-only framework. (b) Our proposed vision-language framework. Compared to (a), our method introduces the explicit prior from language to enhance the 3D scene understanding.$\dashrightarrow$ and $\dashrightarrow$ represents 2D-to-3D and 3D-to-3D feature propagation respectively.
  • Figure 2: Architecture of the proposed LOMA. We input the image and categories text as inputs. The image encoder extracts multi-scale image features from the image and performs 2D-to-3D feature propagation through deformable attention. Meanwhile, the VL-aware scene generator utilizes VLM to generate the scene-level 3D features. We further propose the Multi-scale Triplane Fusion Mamba (MS-TFM) layer to fuse the 3D scene-level vision and language features. Finally, the fused vision feature is used to predict the semantic occupancy. For clarity, the pre-trained depth network is omitted.
  • Figure 3: (a) Architecture of the proposed TFM module. We concatenate the two different modality features along the feature channel and use three Linear layers to project it to three 2D plane features, respectively. Then, a shared SSM block is used to perform global interaction. Subsequently, we use three Linear layers to project the 2D features back to 3D features and sum them up. Finally, the updated vision and language features are split along the feature channel. (b) The detail of the SSM block.
  • Figure 4: Illustration of the proposed MS-TFM layer.
  • Figure 5: Qualitative visualizations on SemanticKITTI val. Our proposed LOMA generates more refined predictions for objects and also preserves organized designs for structures.