Table of Contents
Fetching ...

Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception

Yiding Sun, Jihua Zhu, Haozhe Cheng, Chaoyi Lu, Zhichuan Yang, Lin Chen, Yaonan Wang

TL;DR

The proposed PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work, and extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency.

Abstract

Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 \% accuracy on 3D action recognition, $+8.7 \%$ on 4 D action segmentation, and 84.06\% on 4D semantic segmentation.

Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception

TL;DR

The proposed PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work, and extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency.

Abstract

Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 \% accuracy on 3D action recognition, on 4 D action segmentation, and 84.06\% on 4D semantic segmentation.
Paper Structure (20 sections, 4 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 20 sections, 4 equations, 10 figures, 10 tables, 1 algorithm.

Figures (10)

  • Figure 1: Current 4D PETL methods face two limits. Upper panel: Adapters are feasible (Smiley), yet current methods expose models to severe overfitting (Crying). Lower panel: Cross-modal transfer needs prior alignment. This practice is mature in 2D Vision and NLP. But to our knowledge, no 4D PETL study measures this gap (Crying). Without considering the gap, it will hurt downstream performance undoubtedly (Crying).
  • Figure 2: Comparison of 3D full fine-tuning, 4D full fine-tuning, 4D adapter tuning, and our PointATA.Considering that the embedder required for 4D encoding is often heavier than that for 3D, when facing a 4D dynamic perception task, the number of parameters that need to be updated is even greater (over 100%) than the full update for 3D. PointATA saves large amount of resource and time than full fine-tuning and significantly boosts reuse of 3D pre-trained models. It also exploits 3D priors better than 4D adapter tuning and further cuts parameters to curb overfitting.
  • Figure 3: PointATA employs a two-stage workflow to quickly adapt large 3D pre-trained models to diverse 4D downstream tasks. In Stage 1, it obtains 4D features via the P4D embedder and learns by minimizing the distribution distance to 3D source features. The weight of P4D embedder is randomly initialized. In Stage 2, it jointly fine-tunes the P4D embedder and the PVA to minimize task loss. The 3D backbone remains frozen throughout.
  • Figure 4: Visualization of action segmentation. P4Transformer has a serious over-segmentation problem.
  • Figure 5: 4D semantic segmentation visualization on the Synthia 4D dataset. Key points are demarcated by red dashed circular bounding boxes.
  • ...and 5 more figures