Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

Jeonghyeok Do; Yun Chen; Geunhyuk Youk; Munchurl Kim

Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

Jeonghyeok Do, Yun Chen, Geunhyuk Youk, Munchurl Kim

Abstract

The landscape of skeleton-based action representation learning has evolved from Contrastive Learning (CL) to Masked Auto-Encoder (MAE) architectures. However, each paradigm faces inherent limitations: CL often overlooks fine-grained local details, while MAE is burdened by computationally heavy decoders. Moreover, MAE suffers from severe computational asymmetry -- benefiting from efficient masking during pre-training but requiring exhaustive full-sequence processing for downstream tasks. To resolve these bottlenecks, we propose SLiM (Skeleton Less is More), a novel unified framework that harmonizes masked modeling with contrastive learning via a shared encoder. By eschewing the reconstruction decoder, SLiM not only eliminates computational redundancy but also compels the encoder to capture discriminative features directly. SLiM is the first framework with decoder-free masked modeling of representative learning. Crucially, to prevent trivial reconstruction arising from high skeletal-temporal correlation, we introduce semantic tube masking, alongside skeletal-aware augmentations designed to ensure anatomical consistency across diverse temporal granularities. Extensive experiments demonstrate that SLiM consistently achieves state-of-the-art performance across all downstream protocols. Notably, our method delivers this superior accuracy with exceptional efficiency, reducing inference computational cost by 7.89x compared to existing MAE methods.

Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

Abstract

Paper Structure (25 sections, 7 equations, 3 figures, 8 tables, 4 algorithms)

This paper contains 25 sections, 7 equations, 3 figures, 8 tables, 4 algorithms.

Introduction
Related Works
Contrastive learning (CL)-based Approaches
Masked auto-encoder (MAE)-based Approaches
Other Pretext Tasks
Method
Overview of SLiM
Decoder-Free Masked Modeling with Contrastive Learning
Semantic Tube Masking (STM)
Skeleton-Aware Augmentations (SAA)
Experimental Results
Datasets
Experiment Details
Performance Comparison
Ablation Studies
...and 10 more sections

Figures (3)

Figure 1: Conceptual comparison of previous MAE methods and our SLiM. (a) Standard MAE methods suffer from a $14.38\times$ computational surge during inference relative to pre-training due to asymmetric full-sequence processing. (b) SLiM synergizes masked modeling with contrastive learning in a decoder-free framework. This symmetric design achieves a $7.89\times$ reduction in inference cost compared to MAE baselines.
Figure 2: Overview of SLiM. Our framework unifies Masked Feature Modeling and Global-Local Contrastive Learning within a decoder-free teacher--student architecture. The student encoder simultaneously minimizes the feature reconstruction error ($\mathcal{L}_\text{MFM}$) on masked patches and the contrastive loss ($\mathcal{L}_\text{GLCL}$) across diverse local views, effectively capturing both fine-grained patterns and global semantics.
Figure 3: Comparison of masking and augmentation strategies. Top: previous masking (a) and augmentations (b-d) resulting in trivial solutions or physically implausible poses. Bottom: our Semantic Tube Masking (e) and Skeletal-Aware Augmentations (f-h) ensuring anatomical and physical consistency through skeleton-aware designs.

Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

Abstract

Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

Authors

Abstract

Table of Contents

Figures (3)