Table of Contents
Fetching ...

On the Feasibility and Opportunity of Autoregressive 3D Object Detection

Zanming Huang, Jinsu Yoo, Sooyoung Jeon, Zhenzhen Liu, Mark Campbell, Kilian Q Weinberger, Bharath Hariharan, Wei-Lun Chao, Katie Z Luo

TL;DR

AutoReg3D is an autoregressive 3D detector that casts detection as sequence generation and encodes each object as a short, discrete-token sequence consisting of its center, size, orientation, velocity, and class, enabling straightforward teacher forcing during training and autoregressive decoding at test time.

Abstract

LiDAR-based 3D object detectors typically rely on proposal heads with hand-crafted components like anchor assignment and non-maximum suppression (NMS), complicating training and limiting extensibility. We present AutoReg3D, an autoregressive 3D detector that casts detection as sequence generation. Given point-cloud features, AutoReg3D emits objects in a range-causal (near-to-far) order and encodes each object as a short, discrete-token sequence consisting of its center, size, orientation, velocity, and class. This near-to-far ordering mirrors LiDAR geometry--near objects occlude far ones but not vice versa--enabling straightforward teacher forcing during training and autoregressive decoding at test time. AutoReg3D is compatible across diverse point-cloud or backbones and attains competitive nuScenes performance without anchors or NMS. Beyond parity, the sequential formulation unlocks language-model advances for 3D perception, including GRPO-style reinforcement learning for task-aligned objectives. These results position autoregressive decoding as a viable, flexible alternative for LiDAR-based detection and open a path to importing modern sequence-modeling tools into 3D perception.

On the Feasibility and Opportunity of Autoregressive 3D Object Detection

TL;DR

AutoReg3D is an autoregressive 3D detector that casts detection as sequence generation and encodes each object as a short, discrete-token sequence consisting of its center, size, orientation, velocity, and class, enabling straightforward teacher forcing during training and autoregressive decoding at test time.

Abstract

LiDAR-based 3D object detectors typically rely on proposal heads with hand-crafted components like anchor assignment and non-maximum suppression (NMS), complicating training and limiting extensibility. We present AutoReg3D, an autoregressive 3D detector that casts detection as sequence generation. Given point-cloud features, AutoReg3D emits objects in a range-causal (near-to-far) order and encodes each object as a short, discrete-token sequence consisting of its center, size, orientation, velocity, and class. This near-to-far ordering mirrors LiDAR geometry--near objects occlude far ones but not vice versa--enabling straightforward teacher forcing during training and autoregressive decoding at test time. AutoReg3D is compatible across diverse point-cloud or backbones and attains competitive nuScenes performance without anchors or NMS. Beyond parity, the sequential formulation unlocks language-model advances for 3D perception, including GRPO-style reinforcement learning for task-aligned objectives. These results position autoregressive decoding as a viable, flexible alternative for LiDAR-based detection and open a path to importing modern sequence-modeling tools into 3D perception.
Paper Structure (49 sections, 4 equations, 14 figures, 12 tables, 2 algorithms)

This paper contains 49 sections, 4 equations, 14 figures, 12 tables, 2 algorithms.

Figures (14)

  • Figure 1: Autoregressive Object Detection for 3D. Our work proposes a 3D object detector that leverages a sequential generation representation (a). This eliminates many of the complications associated with the rigid detection pipeline, including anchor assignment, confidence thresholding, and NMS (b).
  • Figure 2: Model Architecture. We leverage an encoder-decoder architecture for encoding point cloud features, then generate tokenized bounding boxes with a causal Transformer decoder. We detokenize the generated sequence to obtain the final set of 3D object detections. This design is compatible with a variety of point cloud encoders, including pillar-, voxel-convolutions, transformer, and Mamba backbone.
  • Figure 3: Precision-Recall Plot. We plot the PR curves for the baseline methods, and the precision-recall point using our autoregressive decoder with a star. Top left: Pillar-based backbone. Top right: Voxel-based backbone. Bottom left: Transformer-based backbone. Bottom right: Mamba-based backbone. We observe that the precision-recall point of AutoReg3D consistently hits or lies outside the PR curves of models with the same backbone.
  • Figure 4: Qualitative Results. (a) Bounding box generations from our method across four different encoder backbones; (b) Cascading refinement visualization with ground-truth boxes (left, outlined in green), predictions from the near-to-far prior model (middle), and resulting predictions (right). Cascading Refinement recovers objects missed by the prior model (circled in red). (c) Failure case example. AutoReg3D generates boxes from first (magenta) to last (blue), ground-truth boxes are in gray. Best viewed in color.
  • Figure A1: Cascading Refinement Methodology.
  • ...and 9 more figures