SFMNet: Sparse Focal Modulation for 3D Object Detection
Oren Shrout, Ayellet Tal
TL;DR
SFMNet tackles the challenge of modeling long-range dependencies in 3D LiDAR object detection without incurring the quadratic cost of self-attention. It introduces Sparse Focal Modulation (SFM), a hierarchical, sparse-convolution-based module that aggregates multi-scale context with gating and modulates per-voxel queries, achieving linear complexity in the number of non-empty voxels. By embedding SFM blocks into both the 3D backbone and BEV-enabled 2D backbone, SFMNet attains substantial improvements on long-range detection while maintaining efficient sparse computation, yielding state-of-the-art results on Argoverse2 and competitive performance on Waymo Open and nuScenes. The approach demonstrates that carefully designed sparse context aggregation can emulate transformer-like global reasoning with far lower computational overhead, enabling scalable, accurate 3D detection for large-scale LiDAR scenes.
Abstract
We propose SFMNet, a novel 3D sparse detector that combines the efficiency of sparse convolutions with the ability to model long-range dependencies. While traditional sparse convolution techniques efficiently capture local structures, they struggle with modeling long-range relationships. However, capturing long-range dependencies is fundamental for 3D object detection. In contrast, transformers are designed to capture these long-range dependencies through attention mechanisms. But, they come with high computational costs, due to their quadratic query-key-value interactions. Furthermore, directly applying attention to non-empty voxels is inefficient due to the sparse nature of 3D scenes. Our SFMNet is built on a novel Sparse Focal Modulation (SFM) module, which integrates short- and long-range contexts with linear complexity by leveraging a new hierarchical sparse convolution design. This approach enables SFMNet to achieve high detection performance with improved efficiency, making it well-suited for large-scale LiDAR scenes. We show that our detector achieves state-of-the-art performance on autonomous driving datasets.
