Table of Contents
Fetching ...

SFMNet: Sparse Focal Modulation for 3D Object Detection

Oren Shrout, Ayellet Tal

TL;DR

SFMNet tackles the challenge of modeling long-range dependencies in 3D LiDAR object detection without incurring the quadratic cost of self-attention. It introduces Sparse Focal Modulation (SFM), a hierarchical, sparse-convolution-based module that aggregates multi-scale context with gating and modulates per-voxel queries, achieving linear complexity in the number of non-empty voxels. By embedding SFM blocks into both the 3D backbone and BEV-enabled 2D backbone, SFMNet attains substantial improvements on long-range detection while maintaining efficient sparse computation, yielding state-of-the-art results on Argoverse2 and competitive performance on Waymo Open and nuScenes. The approach demonstrates that carefully designed sparse context aggregation can emulate transformer-like global reasoning with far lower computational overhead, enabling scalable, accurate 3D detection for large-scale LiDAR scenes.

Abstract

We propose SFMNet, a novel 3D sparse detector that combines the efficiency of sparse convolutions with the ability to model long-range dependencies. While traditional sparse convolution techniques efficiently capture local structures, they struggle with modeling long-range relationships. However, capturing long-range dependencies is fundamental for 3D object detection. In contrast, transformers are designed to capture these long-range dependencies through attention mechanisms. But, they come with high computational costs, due to their quadratic query-key-value interactions. Furthermore, directly applying attention to non-empty voxels is inefficient due to the sparse nature of 3D scenes. Our SFMNet is built on a novel Sparse Focal Modulation (SFM) module, which integrates short- and long-range contexts with linear complexity by leveraging a new hierarchical sparse convolution design. This approach enables SFMNet to achieve high detection performance with improved efficiency, making it well-suited for large-scale LiDAR scenes. We show that our detector achieves state-of-the-art performance on autonomous driving datasets.

SFMNet: Sparse Focal Modulation for 3D Object Detection

TL;DR

SFMNet tackles the challenge of modeling long-range dependencies in 3D LiDAR object detection without incurring the quadratic cost of self-attention. It introduces Sparse Focal Modulation (SFM), a hierarchical, sparse-convolution-based module that aggregates multi-scale context with gating and modulates per-voxel queries, achieving linear complexity in the number of non-empty voxels. By embedding SFM blocks into both the 3D backbone and BEV-enabled 2D backbone, SFMNet attains substantial improvements on long-range detection while maintaining efficient sparse computation, yielding state-of-the-art results on Argoverse2 and competitive performance on Waymo Open and nuScenes. The approach demonstrates that carefully designed sparse context aggregation can emulate transformer-like global reasoning with far lower computational overhead, enabling scalable, accurate 3D detection for large-scale LiDAR scenes.

Abstract

We propose SFMNet, a novel 3D sparse detector that combines the efficiency of sparse convolutions with the ability to model long-range dependencies. While traditional sparse convolution techniques efficiently capture local structures, they struggle with modeling long-range relationships. However, capturing long-range dependencies is fundamental for 3D object detection. In contrast, transformers are designed to capture these long-range dependencies through attention mechanisms. But, they come with high computational costs, due to their quadratic query-key-value interactions. Furthermore, directly applying attention to non-empty voxels is inefficient due to the sparse nature of 3D scenes. Our SFMNet is built on a novel Sparse Focal Modulation (SFM) module, which integrates short- and long-range contexts with linear complexity by leveraging a new hierarchical sparse convolution design. This approach enables SFMNet to achieve high detection performance with improved efficiency, making it well-suited for large-scale LiDAR scenes. We show that our detector achieves state-of-the-art performance on autonomous driving datasets.

Paper Structure

This paper contains 13 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: SFMNet in comparison with other architectures. Similar to sparse convolution-based detectors, e.g.,yin2021center (a), our SFMNet detector (c) comprises a multi-stage 3D backbone with sparse convolution blocks. In contrast, SFMNet is fully sparse, including the 2D backbone, and integrates a novel SFM module (light orange) into both the 3D and 2D backbones. Furthermore, unlike transformers, SFMNet uses sparse convolutions as the token mixer, which makes query modulation efficient. Moreover, unlike the single-stride network design used in transformer-based detectors (e.g., wang2023dsvt), our SFMNet detector employs an encoder-based design.
  • Figure 2: SFM module vs. Local attention. Given a query token (burnt orange), local-based attention (a) identifies and groups non-empty tokens (voxels) within a context window (light gray) to form key and value tokens. The query-key-value interactions have quadratic complexity. In contrast, SFM (b) leverages sparse convolutions to extract multiple levels of context, reducing query interactions to only a few focal contexts and achieving linear complexity with respect to the number of non-empty voxels in the context window. Furthermore, while local-based attention suffers from inefficiencies in batch processing due to the varying numbers of non-empty voxels within each context window, our SFM approach is robust to this variability, extracting a constant number of contexts for each window.
  • Figure 3: Sparse Focal Modulation (SFM). SFM leverages sparse convolutions to aggregate multi-level contexts and modulate them with query features, efficiently capturing both local and global dependencies.
  • Figure 4: SFMNet architecture. (a) The SFMNet architecture extracts voxel features from a point cloud input (VFE), processes them through multiple stages, and converts them to a bird's-eye view (BEV) representation before passing the output to a 2D backbone and detection head. (b) Each stage consists of multiple Sparse Focal Modulation (SFM) blocks followed by Sparse Residual Blocks (SRBs). (c) The SFM block aggregates contextual information and processes it, while the SRB employs SubMConv layers with batch norm (BN) and ReLU activations to capture spatial features.
  • Figure 5: Effective receptive field (ERF) of SFMNet. (a) A typical input scene with height is shown in grayscale. (b,c) The ERF for each ground-truth object is color-coded from blue (outside ERF) to red (high-attention regions). Red and orange rectangles highlight zoomed-in regions. In the right (orange) zoom-in, four stationary cars and a van, plus two moving cars and a van, are shown. Without the SFM module (b), only a local portion of each vehicle is within the ERF. With SFM (c), the entire object is captured, providing more context and spatial understanding for accurate 3D detection.