Table of Contents
Fetching ...

Label-Efficient LiDAR Semantic Segmentation with 2D-3D Vision Transformer Adapters

Julia Hindel, Rohit Mohan, Jelena Bratulic, Daniele Cattaneo, Thomas Brox, Abhinav Valada

TL;DR

BALViT addresses label-efficient LiDAR semantic segmentation by converting a frozen vision transformer into a robust 3D encoder through a 2D-3D adapter that fuses range-view and polar BEV representations. The architecture employs separate RV and BEV encoders, bidirectional cross-attention injectors, and independent decoders, enabling strong performance in low-data regimes while remaining parameter-efficient. Empirical results on SemanticKITTI and nuScenes show BALViT surpassing state-of-the-art supervised and self-supervised baselines at 0.1% and 1% labeling, with competitive performance at higher data volumes. The work highlights the potential of vision foundation models for 3D perception and offers a practical, scalable path for integrating 2D priors into LiDAR-based segmentation.

Abstract

LiDAR semantic segmentation models are typically trained from random initialization as universal pre-training is hindered by the lack of large, diverse datasets. Moreover, most point cloud segmentation architectures incorporate custom network layers, limiting the transferability of advances from vision-based architectures. Inspired by recent advances in universal foundation models, we propose BALViT, a novel approach that leverages frozen vision models as amodal feature encoders for learning strong LiDAR encoders. Specifically, BALViT incorporates both range-view and bird's-eye-view LiDAR encoding mechanisms, which we combine through a novel 2D-3D adapter. While the range-view features are processed through a frozen image backbone, our bird's-eye-view branch enhances them through multiple cross-attention interactions. Thereby, we continuously improve the vision network with domain-dependent knowledge, resulting in a strong label-efficient LiDAR encoding mechanism. Extensive evaluations of BALViT on the SemanticKITTI and nuScenes benchmarks demonstrate that it outperforms state-of-the-art methods on small data regimes. We make the code and models publicly available at: http://balvit.cs.uni-freiburg.de.

Label-Efficient LiDAR Semantic Segmentation with 2D-3D Vision Transformer Adapters

TL;DR

BALViT addresses label-efficient LiDAR semantic segmentation by converting a frozen vision transformer into a robust 3D encoder through a 2D-3D adapter that fuses range-view and polar BEV representations. The architecture employs separate RV and BEV encoders, bidirectional cross-attention injectors, and independent decoders, enabling strong performance in low-data regimes while remaining parameter-efficient. Empirical results on SemanticKITTI and nuScenes show BALViT surpassing state-of-the-art supervised and self-supervised baselines at 0.1% and 1% labeling, with competitive performance at higher data volumes. The work highlights the potential of vision foundation models for 3D perception and offers a practical, scalable path for integrating 2D priors into LiDAR-based segmentation.

Abstract

LiDAR semantic segmentation models are typically trained from random initialization as universal pre-training is hindered by the lack of large, diverse datasets. Moreover, most point cloud segmentation architectures incorporate custom network layers, limiting the transferability of advances from vision-based architectures. Inspired by recent advances in universal foundation models, we propose BALViT, a novel approach that leverages frozen vision models as amodal feature encoders for learning strong LiDAR encoders. Specifically, BALViT incorporates both range-view and bird's-eye-view LiDAR encoding mechanisms, which we combine through a novel 2D-3D adapter. While the range-view features are processed through a frozen image backbone, our bird's-eye-view branch enhances them through multiple cross-attention interactions. Thereby, we continuously improve the vision network with domain-dependent knowledge, resulting in a strong label-efficient LiDAR encoding mechanism. Extensive evaluations of BALViT on the SemanticKITTI and nuScenes benchmarks demonstrate that it outperforms state-of-the-art methods on small data regimes. We make the code and models publicly available at: http://balvit.cs.uni-freiburg.de.

Paper Structure

This paper contains 18 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Different learning paradigms for LiDAR semantic segmentation. Learned modules are colored in red and frozen components in blue. (a) Traditional methods leverage randomly initialized LiDAR networks. (b) Vision or language foundation models are employed to distill knowledge into tailored LiDAR architectures. (c) Transferring a pre-trained vision model into the LiDAR domain using a range-view projection and a 2D-3D adapter (ours).
  • Figure 2: Our network BALViT encodes a point cloud in orthogonal range-view (RV) and bird-eye-view (BEV) branches. Our spatial prior module converts the BEV branch into multi-scale features, which interact with the RV branch during its traversal of the frozen ViT backbone. Last, our two decoders independently obtain pointwise class labels from the respective feature maps.
  • Figure 3: Qualitative results of BALViT on LiDAR semantic segmentation on nuScenes and SemanticKITTI.