What Matters in Range View 3D Object Detection

Benjamin Wilson; Nicholas Autio Mitchell; Jhony Kaesemodel Pontes; James Hays

What Matters in Range View 3D Object Detection

Benjamin Wilson, Nicholas Autio Mitchell, Jhony Kaesemodel Pontes, James Hays

TL;DR

This work analyzes range-view 3D object detection and demonstrates that a simple, well-tuned range-view model can achieve state-of-the-art results without the bells and whistles of prior literature. It identifies four key design decisions—input feature dimensionality, 3D input encoding, 3D classification supervision, and range-based subsampling—as the primary levers for performance and runtime. The authors introduce Dynamic 3D Centerness, a Gaussian proximity-based supervision signal, and Range Subsampling to reduce proposals, showing their effectiveness across Argoverse 2 and Waymo Open, with a clear improvement in small-object detection. The resulting model is open-source, multi-class, and competitive with voxel-based methods on Argoverse 2, and establishes a new state-of-the-art among range-view models on Waymo Open, while achieving around 10 Hz. These findings suggest that simple, principled range-view techniques can tightly match or exceed more complex approaches, guiding future range-view research toward efficient, scalable designs with practical impact.

Abstract

Lidar-based perception pipelines rely on 3D object detection models to interpret complex scenes. While multiple representations for lidar exist, the range-view is enticing since it losslessly encodes the entire lidar sensor output. In this work, we achieve state-of-the-art amongst range-view 3D object detection models without using multiple techniques proposed in past range-view literature. We explore range-view 3D object detection across two modern datasets with substantially different properties: Argoverse 2 and Waymo Open. Our investigation reveals key insights: (1) input feature dimensionality significantly influences the overall performance, (2) surprisingly, employing a classification loss grounded in 3D spatial proximity works as well or better compared to more elaborate IoU-based losses, and (3) addressing non-uniform lidar density via a straightforward range subsampling technique outperforms existing multi-resolution, range-conditioned networks. Our experiments reveal that techniques proposed in recent range-view literature are not needed to achieve state-of-the-art performance. Combining the above findings, we establish a new state-of-the-art model for range-view 3D object detection -- improving AP by 2.2% on the Waymo Open dataset while maintaining a runtime of 10 Hz. We establish the first range-view model on the Argoverse 2 dataset and outperform strong voxel-based baselines. All models are multi-class and open-source. Code is available at https://github.com/benjaminrwilson/range-view-3d-detection.

What Matters in Range View 3D Object Detection

TL;DR

Abstract

Paper Structure (38 sections, 11 equations, 7 figures, 10 tables)

This paper contains 38 sections, 11 equations, 7 figures, 10 tables.

Introduction
Related Work
Point-based Methods.
Grid-based Projections.
Range-based Projections.
Multi-View Projections.
Range-view 3D Object Detection
Inputs: Range View Features.
3D Object Detection.
3D Input Encoding.
Scaling Input Feature Dimensionality.
Dynamic 3D Centerness.
Range Subsampling.
Experiments
Datasets
...and 23 more sections

Figures (7)

Figure 1: Range View Representation. We illustrate the connection between the range view representation (top) and an "over-the-shoulder" view of a 3D scene (bottom). The range view representation encodes large 3D scenes recorded by a rotating lidar sensor into a compact image which can be directly processed by CNN-based architectures. We show a building in both views in the top left (enclosed in black boxes). Warmer colors indicate points closer to the lidar sensor, while cooler colors represent points distant from the sensor.
Figure 2: Network Inputs: Range View Features. The input to our network for the Waymo Open dataset consists of auxiliary features (elongation and intensity) and geometric features (range, x, y, z). Each channel is re-mapped to represent warmer colors as the smallest values and cooler colors as the largest values within their respective domains. White pixels indicate invalid returns.
Figure 3: 3D Object Detection in the Range View. We show a range image, the object confidences from a network, and their corresponding 3D cuboids shown in the bird's-eye view for a scene with multiple parked vehicles. For each visible point in the range image, our range-view 3D object detection model learns (1) which category an object belongs to (2) the offset from the visible point to the center of the object, its 3D size, and its orientation. In the above example, we show one particular point (Point A) from two different perspectives --- the range view and the bird's-eye view. Blue boxes indicate the ground truth cuboids, green boxes indicate true positives, and red boxes indicate false positives. Importantly, each object can have many thousands of proposals --- however, most will be removed through non-maximum suppression.
Figure 4: Model Architecture. We explore a variety of design decisions in range-view based 3D object detection models. Our overall framework is shown above. Range view features are processed by a 3D input encoding which modulates features by their proximity in 3D space. These features are subsequently passed to a backbone CNN for feature extraction and sharing. The classification and regression process these features and produce classification likelihoods and object regression parameters, respectively. The regression parameters are compared with their ground truth target assignments to produce classification targets which incorporate regression-quality. The classification scores and decoded bounding boxes are subsampled by our Range Subsampling method and then finally clustered via non-maximum supression to produce the final set of likelihoods and scores. Blue boxes indicate core components in the network and boxes outlined in black indicate components which we explicitly ablate and explore.
Figure 5: Dynamic 3D Classification Supervision. We decode object proposals at each 3D point in a range image during training in order to rank them and compute a soft classification target $t_i$. In the above example, we show two object points, $p_1$ (red) and $p_2$ (blue), their corresponding proposals decoded from the network (color-coded), the soft targets $t_1$ and $t_2$, and the radii computed for Dynamic 3D Centerness, $r_1$ and $r_2$. We illustrate the differences between IoU-based (left) and our proposed Dynamic 3D Centerness (right) rankings. IoU-based metrics are sensitive to translation error and can provide no signal when there is no overlap between the decoded proposal and the ground truth object. Dynamic 3D centerness does not suffer from the same problem.
...and 2 more figures

What Matters in Range View 3D Object Detection

TL;DR

Abstract

What Matters in Range View 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)