Table of Contents
Fetching ...

Towards Category Unification of 3D Single Object Tracking on Point Clouds

Jiahao Nie, Zhiwei He, Xudong Lv, Xueyi Zhou, Dong-Kyu Chae, Fei Xie

TL;DR

Unified models that can simultaneously track objects across all categories using a single network with shared model parameters are introduced and explicitly encode distinct attributes associated to different object categories, enabling the model to adapt to cross-category data.

Abstract

Category-specific models are provenly valuable methods in 3D single object tracking (SOT) regardless of Siamese or motion-centric paradigms. However, such over-specialized model designs incur redundant parameters, thus limiting the broader applicability of 3D SOT task. This paper first introduces unified models that can simultaneously track objects across all categories using a single network with shared model parameters. Specifically, we propose to explicitly encode distinct attributes associated to different object categories, enabling the model to adapt to cross-category data. We find that the attribute variances of point cloud objects primarily occur from the varying size and shape (e.g., large and square vehicles v.s. small and slender humans). Based on this observation, we design a novel point set representation learning network inheriting transformer architecture, termed AdaFormer, which adaptively encodes the dynamically varying shape and size information from cross-category data in a unified manner. We further incorporate the size and shape prior derived from the known template targets into the model's inputs and learning objective, facilitating the learning of unified representation. Equipped with such designs, we construct two category-unified models SiamCUT and MoCUT.Extensive experiments demonstrate that SiamCUT and MoCUT exhibit strong generalization and training stability. Furthermore, our category-unified models outperform the category-specific counterparts by a significant margin (e.g., on KITTI dataset, 12% and 3% performance gains on the Siamese and motion paradigms). Our code will be available.

Towards Category Unification of 3D Single Object Tracking on Point Clouds

TL;DR

Unified models that can simultaneously track objects across all categories using a single network with shared model parameters are introduced and explicitly encode distinct attributes associated to different object categories, enabling the model to adapt to cross-category data.

Abstract

Category-specific models are provenly valuable methods in 3D single object tracking (SOT) regardless of Siamese or motion-centric paradigms. However, such over-specialized model designs incur redundant parameters, thus limiting the broader applicability of 3D SOT task. This paper first introduces unified models that can simultaneously track objects across all categories using a single network with shared model parameters. Specifically, we propose to explicitly encode distinct attributes associated to different object categories, enabling the model to adapt to cross-category data. We find that the attribute variances of point cloud objects primarily occur from the varying size and shape (e.g., large and square vehicles v.s. small and slender humans). Based on this observation, we design a novel point set representation learning network inheriting transformer architecture, termed AdaFormer, which adaptively encodes the dynamically varying shape and size information from cross-category data in a unified manner. We further incorporate the size and shape prior derived from the known template targets into the model's inputs and learning objective, facilitating the learning of unified representation. Equipped with such designs, we construct two category-unified models SiamCUT and MoCUT.Extensive experiments demonstrate that SiamCUT and MoCUT exhibit strong generalization and training stability. Furthermore, our category-unified models outperform the category-specific counterparts by a significant margin (e.g., on KITTI dataset, 12% and 3% performance gains on the Siamese and motion paradigms). Our code will be available.
Paper Structure (23 sections, 11 equations, 8 figures, 10 tables)

This paper contains 23 sections, 11 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Comparison between different tracking models. In previous category-specific models (a), multiple networks are required to perform individual tracking task for each category. In contrast, our category-unified models (b) can simultaneously track objects across all categories using a single network with shared parameters.
  • Figure 2: Overall architecture of AdaFormer. The proposed unified representation network shares a similar three-stage hierarchical structure with existing point set network pointnet++ used in 3D SOT, consisting of a series of subsample operators and AdaFormer blocks. Our representation network is empowered to learn dynamic groups through a deformable group vector-attention sub-block, thereby enabling a variable range of receptive fields.
  • Figure 3: Comparison between different search regions, i.e., model inputs. The previous methods (a) generate search regions by expanding the predicted result of previous frame by a fixed 3D distance, while our method (b) expands it by a scale to the width, height and length of target objects.
  • Figure 4: Generalization and stability comparisons of with and without the proposed unified components. We plot "performance v.s. epoch" curves on four categories, and include error bands calculated by running the corresponding experiments three times using different random seeds.
  • Figure 5: Numerical statistics of object's length, width and relative motion ($\Delta x$, $\Delta y$, $\Delta z$, $\Delta \theta$) of two adjacent frames on Car and Pedestrian categories from KITTI dataset.
  • ...and 3 more figures