Table of Contents
Fetching ...

RegTrack: Simplicity Beneath Complexity in Robust Multi-Modal 3D Multi-Object Tracking

Lipeng Gu, Xuefeng Yan, Song Wang, Mingqiang Wei

TL;DR

This work proposes a robust, efficient, and generalizable method for multi-modal 3D MOT, dubbed RegTrack, built upon a unified tri-cue encoder, comprising three tightly coupled components: a local-global point cloud encoder, a mixture-of-experts-based geometry encoder, and an image encoder from a well-pretrained visual-language model.

Abstract

Existing 3D multi-object tracking (MOT) methods often sacrifice efficiency and generalizability for robustness, largely relying on complex association metrics derived from multi-modal architectures and class-specific motion priors. Challenging the rooted belief that greater complexity necessarily yields greater robustness, we propose a robust, efficient, and generalizable method for multi-modal 3D MOT, dubbed RegTrack. Inspired by Yang-Mills gauge theory, RegTrack is built upon a unified tri-cue encoder (UTEnc), comprising three tightly coupled components: a local-global point cloud encoder (LG-PEnc), a mixture-of-experts-based geometry encoder (MoE-GEnc), and an image encoder from a well-pretrained visual-language model. LG-PEnc efficiently encodes the spatial and structural information of point clouds to produce foundational representations for each object, whose pairwise similarities serve as the sole association metric. MoE-GEnc seamlessly interacts with LG-PEnc to model inter-object geometric relationships across frames, adaptively compensating for inter-frame object motion without relying on any class-specific priors. The image encoder is kept frozen and is used exclusively during training to provide a well-pretrained representation space. Point cloud representations are aligned to this space to supervise the motion compensation process, encouraging representation invariance across frames for the same object while enhancing discriminability among different objects. Through this formulation, RegTrack attains robust, efficient, and generalizable inference using only point cloud inputs, requiring just 2.6M parameters. Extensive experiments on KITTI and nuScenes show that RegTrack outperforms its thirty-five competitors.

RegTrack: Simplicity Beneath Complexity in Robust Multi-Modal 3D Multi-Object Tracking

TL;DR

This work proposes a robust, efficient, and generalizable method for multi-modal 3D MOT, dubbed RegTrack, built upon a unified tri-cue encoder, comprising three tightly coupled components: a local-global point cloud encoder, a mixture-of-experts-based geometry encoder, and an image encoder from a well-pretrained visual-language model.

Abstract

Existing 3D multi-object tracking (MOT) methods often sacrifice efficiency and generalizability for robustness, largely relying on complex association metrics derived from multi-modal architectures and class-specific motion priors. Challenging the rooted belief that greater complexity necessarily yields greater robustness, we propose a robust, efficient, and generalizable method for multi-modal 3D MOT, dubbed RegTrack. Inspired by Yang-Mills gauge theory, RegTrack is built upon a unified tri-cue encoder (UTEnc), comprising three tightly coupled components: a local-global point cloud encoder (LG-PEnc), a mixture-of-experts-based geometry encoder (MoE-GEnc), and an image encoder from a well-pretrained visual-language model. LG-PEnc efficiently encodes the spatial and structural information of point clouds to produce foundational representations for each object, whose pairwise similarities serve as the sole association metric. MoE-GEnc seamlessly interacts with LG-PEnc to model inter-object geometric relationships across frames, adaptively compensating for inter-frame object motion without relying on any class-specific priors. The image encoder is kept frozen and is used exclusively during training to provide a well-pretrained representation space. Point cloud representations are aligned to this space to supervise the motion compensation process, encouraging representation invariance across frames for the same object while enhancing discriminability among different objects. Through this formulation, RegTrack attains robust, efficient, and generalizable inference using only point cloud inputs, requiring just 2.6M parameters. Extensive experiments on KITTI and nuScenes show that RegTrack outperforms its thirty-five competitors.
Paper Structure (33 sections, 14 equations, 11 figures, 14 tables)

This paper contains 33 sections, 14 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: Comparison between existing methods jrmotjmodtfantrackmmmotBcMODTmmf-jdt (a) and our RegTrack (b). i) Training: RegTrack leverages the representation space of a well-pretrained CLIP image encoder to supervise the joint learning of the point cloud and geometry encoders, thereby facilitating the learning of motion-compensated point cloud representations. ii) Inference: RegTrack employs the point cloud and geometry encoders to encode point cloud inputs, yielding object representations. Based on these representations, it constructs a fixed-threshold association metric, achieving superior robustness, efficiency, and generalizability compared with existing methods that rely on intricate multi-modal architectures and geometric constraints driven by class-specific motion priors.
  • Figure 2: Motivation and inference framework of RegTrack. Inspired by Yang–Mills gauge theory, RegTrack employs a geometry encoder to model geometric cues for adaptive motion compensation under the supervision of a CLIP image representation space during training. Consequently, the point cloud encoder learns cross-frame invariant object representations for constructing a robust association metric. During inference, RegTrack relies only on the point cloud encoder (LG-PEnc) and the geometric encoder (MoE-GEnc), without requiring the image encoder. Detections are preprocessed by cropping point cloud patches and resampling each patch to $K$ points, while trajectories at frame $t-1$ are propagated to frame $t$ via 3D Kalman filter (KF) prediction (see Fig. \ref{['fig:kf_prediction']}). The lifecycle management module handles track birth, update, and death.
  • Figure 3: Pipeline of 3D Kalman filter prediction. Taking an object from a trajectory at frame $t-1$ as an example, the object is cropped into a point cloud patch and resampled to $K$ points. The patch and its 3D bounding box are transformed into the global coordinate frame and propagated to frame $t$ via a 3D Kalman filter prediction. The predicted patch and bounding box are then transformed into the LiDAR coordinate frame at frame $t$.
  • Figure 4: Training pipeline of UTEnc. UTEnc consists of three sub-modules: a local–global point cloud encoder (LG-PEnc), a mixture-of-experts-based geometry encoder (MoE-GEnc), and a well-pretrained CLIP image encoder. LG-PEnc encodes point cloud objects to produce object representations $\mathcal{P}$. MoE-GEnc then performs adaptive motion compensation on a pair of inter-frame point cloud representations $\mathcal{P}_{t-1}$ and $\mathcal{P}_{t}$, guided by the composite routing (CR) loss, yielding motion-compensated representations $\mathcal{P}^{\mathcal{G}}_{t-1}$ and $\mathcal{P}^{\mathcal{G}}_{t}$. The image encoder is frozen and used only during training to provide a globally invariant reference space that supervises the compensation process via the tri-cue unification (TU) loss.
  • Figure 5: Overview of LG-PEnc. It takes a point cloud patch $\mathcal{P}_{in} \in \mathbb R^{N \times c}$ and extracts global $\mathcal{P}_g \in \mathbb R^{1 \times 1024}$ and local $\mathcal{P}_l \in \mathbb R^{1 \times 1024}$ features. The two features are aggregated using a learnable parameter $\alpha \in \mathbb R^{1 \times 1024}$ and a 1D convolution to output the object representation $\mathcal{P} \in \mathbb R^{1 \times 512}$.
  • ...and 6 more figures