Table of Contents
Fetching ...

RayMamba: Ray-Aligned Serialization for Long-Range 3D Object Detection

Cheng Lu, Mingqian Ji, Shanshan Zhang, Zhihao Li, Jian Yang

Abstract

Long-range 3D object detection remains challenging because LiDAR observations become highly sparse and fragmented in the far field, making reliable context modeling difficult for existing detectors. To address this issue, recent state space model (SSM)-based methods have improved long-range modeling efficiency. However, their effectiveness is still limited by generic serialization strategies that fail to preserve meaningful contextual neighborhoods in sparse scenes. To address this issue, we propose RayMamba, a geometry-aware plug-and-play enhancement for voxel-based 3D detectors. RayMamba organizes sparse voxels into sector-wise ordered sequences through a ray-aligned serialization strategy, which preserves directional continuity and occlusion-related context for subsequent Mamba-based modeling. It is compatible with both LiDAR-only and multimodal detectors, while introducing only modest overhead. Extensive experiments on nuScenes and Argoverse 2 demonstrate consistent improvements across strong baselines. In particular, RayMamba achieves up to 2.49 mAP and 1.59 NDS gain in the challenging 40--50 m range on nuScenes, and further improves VoxelNeXt on Argoverse 2 from 30.3 to 31.2 mAP.

RayMamba: Ray-Aligned Serialization for Long-Range 3D Object Detection

Abstract

Long-range 3D object detection remains challenging because LiDAR observations become highly sparse and fragmented in the far field, making reliable context modeling difficult for existing detectors. To address this issue, recent state space model (SSM)-based methods have improved long-range modeling efficiency. However, their effectiveness is still limited by generic serialization strategies that fail to preserve meaningful contextual neighborhoods in sparse scenes. To address this issue, we propose RayMamba, a geometry-aware plug-and-play enhancement for voxel-based 3D detectors. RayMamba organizes sparse voxels into sector-wise ordered sequences through a ray-aligned serialization strategy, which preserves directional continuity and occlusion-related context for subsequent Mamba-based modeling. It is compatible with both LiDAR-only and multimodal detectors, while introducing only modest overhead. Extensive experiments on nuScenes and Argoverse 2 demonstrate consistent improvements across strong baselines. In particular, RayMamba achieves up to 2.49 mAP and 1.59 NDS gain in the challenging 40--50 m range on nuScenes, and further improves VoxelNeXt on Argoverse 2 from 30.3 to 31.2 mAP.

Paper Structure

This paper contains 19 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Due to occlusion and distance-induced sparsity in LiDAR, distant objects are often represented by only a few returns.
  • Figure 2: Comparison of 1D sequence context in long-range sparse scenes. For a given far-field reference voxel (red star), we highlight its context window of $K=360$ adjacent voxels in the serialized sequence. Our ray-aligned ordering (blue) preserves directionally coherent physical structures, whereas the Hilbert ordering (purple) activates spatially scattered, unrelated regions.
  • Figure 3: Overview of RayMamba. Top: RayMamba blocks are inserted into a sparse 3D convolutional backbone. Bottom: Structure of a RayMamba block. RayMamba consists of two components: Ray-Aligned Serialization, which converts sparse voxel features into sector-wise ordered sequences using an offline-generated dense sector template, and SectorMamba3D, which performs sector-wise sequence modeling before the enhanced features are restored to sparse 3D space through Sequence-to-Spatial and sparse deconvolution.
  • Figure 4: Ray-aligned serialization strategy. (a) Azimuth sector partitioning: The BEV space is divided into independent angular sectors to separate directionally distinct regions. (b) Sector-wise ordering: Voxels in each sector are serialized by first traversing height layers from top to bottom, introducing a vertical layering prior, and then applying angular ordering within each layer to preserve directional continuity. For a fixed voxel grid, the resulting sector assignments and ordering scores are precomputed as a dense sector template, which is queried at runtime to convert active sparse voxels into sector-wise ordered sequences.
  • Figure 5: Qualitative comparison on challenging long-range occluded targets. Green boxes denote ground truth, red dashed boxes denote baseline predictions, and blue dashed boxes denote RayMamba predictions.