Label-Efficient LiDAR Semantic Segmentation with 2D-3D Vision Transformer Adapters

Julia Hindel; Rohit Mohan; Jelena Bratulic; Daniele Cattaneo; Thomas Brox; Abhinav Valada

Label-Efficient LiDAR Semantic Segmentation with 2D-3D Vision Transformer Adapters

Julia Hindel, Rohit Mohan, Jelena Bratulic, Daniele Cattaneo, Thomas Brox, Abhinav Valada

TL;DR

BALViT addresses label-efficient LiDAR semantic segmentation by converting a frozen vision transformer into a robust 3D encoder through a 2D-3D adapter that fuses range-view and polar BEV representations. The architecture employs separate RV and BEV encoders, bidirectional cross-attention injectors, and independent decoders, enabling strong performance in low-data regimes while remaining parameter-efficient. Empirical results on SemanticKITTI and nuScenes show BALViT surpassing state-of-the-art supervised and self-supervised baselines at 0.1% and 1% labeling, with competitive performance at higher data volumes. The work highlights the potential of vision foundation models for 3D perception and offers a practical, scalable path for integrating 2D priors into LiDAR-based segmentation.

Abstract

LiDAR semantic segmentation models are typically trained from random initialization as universal pre-training is hindered by the lack of large, diverse datasets. Moreover, most point cloud segmentation architectures incorporate custom network layers, limiting the transferability of advances from vision-based architectures. Inspired by recent advances in universal foundation models, we propose BALViT, a novel approach that leverages frozen vision models as amodal feature encoders for learning strong LiDAR encoders. Specifically, BALViT incorporates both range-view and bird's-eye-view LiDAR encoding mechanisms, which we combine through a novel 2D-3D adapter. While the range-view features are processed through a frozen image backbone, our bird's-eye-view branch enhances them through multiple cross-attention interactions. Thereby, we continuously improve the vision network with domain-dependent knowledge, resulting in a strong label-efficient LiDAR encoding mechanism. Extensive evaluations of BALViT on the SemanticKITTI and nuScenes benchmarks demonstrate that it outperforms state-of-the-art methods on small data regimes. We make the code and models publicly available at: http://balvit.cs.uni-freiburg.de.

Label-Efficient LiDAR Semantic Segmentation with 2D-3D Vision Transformer Adapters

TL;DR

Abstract

Label-Efficient LiDAR Semantic Segmentation with 2D-3D Vision Transformer Adapters

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)