Table of Contents
Fetching ...

MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion

Lehong Wu, Lilang Lin, Jiahang Zhang, Yiyang Ma, Jiaying Liu

TL;DR

The proposed Masked Conditional Diffusion (MacDiff) achieves state-of-the-art performance on representation learning benchmarks while maintaining the competence for generative tasks, and leverage the diffusion model for data augmentation, significantly enhancing the fine-tuning performance in scenarios with scarce labeled data.

Abstract

Self-supervised learning has proved effective for skeleton-based human action understanding. However, previous works either rely on contrastive learning that suffers false negative problems or are based on reconstruction that learns too much unessential low-level clues, leading to limited representations for downstream tasks. Recently, great advances have been made in generative learning, which is naturally a challenging yet meaningful pretext task to model the general underlying data distributions. However, the representation learning capacity of generative models is under-explored, especially for the skeletons with spacial sparsity and temporal redundancy. To this end, we propose Masked Conditional Diffusion (MacDiff) as a unified framework for human skeleton modeling. For the first time, we leverage diffusion models as effective skeleton representation learners. Specifically, we train a diffusion decoder conditioned on the representations extracted by a semantic encoder. Random masking is applied to encoder inputs to introduce a information bottleneck and remove redundancy of skeletons. Furthermore, we theoretically demonstrate that our generative objective involves the contrastive learning objective which aligns the masked and noisy views. Meanwhile, it also enforces the representation to complement for the noisy view, leading to better generalization performance. MacDiff achieves state-of-the-art performance on representation learning benchmarks while maintaining the competence for generative tasks. Moreover, we leverage the diffusion model for data augmentation, significantly enhancing the fine-tuning performance in scenarios with scarce labeled data. Our project is available at https://lehongwu.github.io/ECCV24MacDiff/.

MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion

TL;DR

The proposed Masked Conditional Diffusion (MacDiff) achieves state-of-the-art performance on representation learning benchmarks while maintaining the competence for generative tasks, and leverage the diffusion model for data augmentation, significantly enhancing the fine-tuning performance in scenarios with scarce labeled data.

Abstract

Self-supervised learning has proved effective for skeleton-based human action understanding. However, previous works either rely on contrastive learning that suffers false negative problems or are based on reconstruction that learns too much unessential low-level clues, leading to limited representations for downstream tasks. Recently, great advances have been made in generative learning, which is naturally a challenging yet meaningful pretext task to model the general underlying data distributions. However, the representation learning capacity of generative models is under-explored, especially for the skeletons with spacial sparsity and temporal redundancy. To this end, we propose Masked Conditional Diffusion (MacDiff) as a unified framework for human skeleton modeling. For the first time, we leverage diffusion models as effective skeleton representation learners. Specifically, we train a diffusion decoder conditioned on the representations extracted by a semantic encoder. Random masking is applied to encoder inputs to introduce a information bottleneck and remove redundancy of skeletons. Furthermore, we theoretically demonstrate that our generative objective involves the contrastive learning objective which aligns the masked and noisy views. Meanwhile, it also enforces the representation to complement for the noisy view, leading to better generalization performance. MacDiff achieves state-of-the-art performance on representation learning benchmarks while maintaining the competence for generative tasks. Moreover, we leverage the diffusion model for data augmentation, significantly enhancing the fine-tuning performance in scenarios with scarce labeled data. Our project is available at https://lehongwu.github.io/ECCV24MacDiff/.
Paper Structure (14 sections, 1 theorem, 11 equations, 1 figure, 10 tables)

This paper contains 14 sections, 1 theorem, 11 equations, 1 figure, 10 tables.

Key Result

theorem thmcountertheorem

(Bayes Error Rate of Representations) For arbitrary data representation distribution $Z$, and $V$ denotes a certain view of the data, its Bayes error rate can be estimated as:

Figures (1)

  • Figure 1: The overview of the proposed method. We train a diffusion decoder conditioned on the representations extracted by a semantic encoder. In the above stream, we embed the input skeletons into tokens and employ random masking. The global representation is obtained by pooling the local representations extracted by the semantic encoder. In the below stream, we train a conditional diffusion model. We sample the noisy skeleton $\boldsymbol{x}_t$ following the diffusion process $q(\boldsymbol{x}_t|\boldsymbol{x}_0)$. The diffusion decoder predicts the noise $\epsilon$ from $\boldsymbol{x}_t$ guided by the learned representation $\boldsymbol{z}$. The pre-trained encoder can be utilized independently in downstream discriminative tasks.

Theorems & Definitions (1)

  • theorem thmcountertheorem