Table of Contents
Fetching ...

RangeSAM: On the Potential of Visual Foundation Models for Range-View represented LiDAR segmentation

Paul Julius Kühn, Duc Anh Nguyen, Arjan Kuijper, Holger Graf, Saptarshi Neil Sinha

TL;DR

This work tackles efficient LiDAR point cloud segmentation by projecting unordered scans into range images and leveraging SAM2 as a backbone. It introduces RangeSAM, a range-view segmentation framework with a horizontally biased, SAM2-based encoder (Hiera blocks) and a Receptive Field Block decoder, plus k-NN postprocessing and a composite loss. Training integrates SemanticKITTI and nuScenes with extensive augmentations, yielding a ~63 million-parameter model that achieves a mean IoU of $60.9\%$ on SemanticKITTI validation, demonstrating the viability of Visual Foundation Models for 3D perception. The results suggest VFMs can provide competitive, deployment-friendly backbones for 3D LiDAR segmentation and motivate further work on efficiency and broader foundation-model integration.

Abstract

Point cloud segmentation is central to autonomous driving and 3D scene understanding. While voxel- and point-based methods dominate recent research due to their compatibility with deep architectures and ability to capture fine-grained geometry, they often incur high computational cost, irregular memory access, and limited real-time efficiency. In contrast, range-view methods, though relatively underexplored - can leverage mature 2D semantic segmentation techniques for fast and accurate predictions. Motivated by the rapid progress in Visual Foundation Models (VFMs) for captioning, zero-shot recognition, and multimodal tasks, we investigate whether SAM2, the current state-of-the-art VFM for segmentation tasks, can serve as a strong backbone for LiDAR point cloud segmentation in the range view. We present , to our knowledge, the first range-view framework that adapts SAM2 to 3D segmentation, coupling efficient 2D feature extraction with standard projection/back-projection to operate on point clouds. To optimize SAM2 for range-view representations, we implement several architectural modifications to the encoder: (1) a novel module that emphasizes horizontal spatial dependencies inherent in LiDAR range images, (2) a customized configuration of tailored to the geometric properties of spherical projections, and (3) an adapted mechanism in the encoder backbone specifically designed to capture the unique spatial patterns and discontinuities present in range-view pseudo-images. Our approach achieves competitive performance on SemanticKITTI while benefiting from the speed, scalability, and deployment simplicity of 2D-centric pipelines. This work highlights the viability of VFMs as general-purpose backbones for 3D perception and opens a path toward unified, foundation-model-driven LiDAR segmentation. Results lets us conclude that range-view segmentation methods using VFMs leads to promising results.

RangeSAM: On the Potential of Visual Foundation Models for Range-View represented LiDAR segmentation

TL;DR

This work tackles efficient LiDAR point cloud segmentation by projecting unordered scans into range images and leveraging SAM2 as a backbone. It introduces RangeSAM, a range-view segmentation framework with a horizontally biased, SAM2-based encoder (Hiera blocks) and a Receptive Field Block decoder, plus k-NN postprocessing and a composite loss. Training integrates SemanticKITTI and nuScenes with extensive augmentations, yielding a ~63 million-parameter model that achieves a mean IoU of on SemanticKITTI validation, demonstrating the viability of Visual Foundation Models for 3D perception. The results suggest VFMs can provide competitive, deployment-friendly backbones for 3D LiDAR segmentation and motivate further work on efficiency and broader foundation-model integration.

Abstract

Point cloud segmentation is central to autonomous driving and 3D scene understanding. While voxel- and point-based methods dominate recent research due to their compatibility with deep architectures and ability to capture fine-grained geometry, they often incur high computational cost, irregular memory access, and limited real-time efficiency. In contrast, range-view methods, though relatively underexplored - can leverage mature 2D semantic segmentation techniques for fast and accurate predictions. Motivated by the rapid progress in Visual Foundation Models (VFMs) for captioning, zero-shot recognition, and multimodal tasks, we investigate whether SAM2, the current state-of-the-art VFM for segmentation tasks, can serve as a strong backbone for LiDAR point cloud segmentation in the range view. We present , to our knowledge, the first range-view framework that adapts SAM2 to 3D segmentation, coupling efficient 2D feature extraction with standard projection/back-projection to operate on point clouds. To optimize SAM2 for range-view representations, we implement several architectural modifications to the encoder: (1) a novel module that emphasizes horizontal spatial dependencies inherent in LiDAR range images, (2) a customized configuration of tailored to the geometric properties of spherical projections, and (3) an adapted mechanism in the encoder backbone specifically designed to capture the unique spatial patterns and discontinuities present in range-view pseudo-images. Our approach achieves competitive performance on SemanticKITTI while benefiting from the speed, scalability, and deployment simplicity of 2D-centric pipelines. This work highlights the viability of VFMs as general-purpose backbones for 3D perception and opens a path toward unified, foundation-model-driven LiDAR segmentation. Results lets us conclude that range-view segmentation methods using VFMs leads to promising results.

Paper Structure

This paper contains 16 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Our approach leverages the Segment Anything foundation model (SAM2) SAM2 on range‐view projections of 3D point clouds: (1) render the point cloud as a range image; (2) apply SAM2 to obtain high‐quality 2D masks; (3) back‐project these masks onto the original point cloud, yielding improved segmentation accuracy.
  • Figure 2: Overview of our SAM2-based point-cloud segmentation model. The stem reshapes range-view point clouds into a tensor suitable for the encoder. The encoder is built from stacked Hiera blocks—each containing a multi-head self-attention module and a feed-forward network SAM2. The decoder comprises Receptive Field Blocks ReceptiveFieldBlock (RFB) with LayerNorm layernormalization and GELU GELU, concatenates multi-scale features, and projects them to $N_{\text{classes}}$ while also adding auxiliary head (Aux) on corresponding output.
  • Figure 3: Building blocks of the stem module in RangeSAM
  • Figure 4: Building blocks of the encoder module in RangeSAM
  • Figure 5: Qualitative segmentation examples of increasing difficulty using our model. (a) Urban intersection, (b) Suburban environment and (c) Highly cluttered scene.