Table of Contents
Fetching ...

Multi Activity Sequence Alignment via Implicit Clustering

Taein Kwon, Zador Pataki, Mahdi Rad, Marc Pollefeys

TL;DR

MASA addresses the challenge of self-supervised, dense temporal alignment across multiple activities without training separate models per activity. It introduces implicit clip-level clustering coupled with a dual augmentation strategy, integrated into a three-component framework: augmentation, context-aware embeddings, and alignment-cluster learning. The method demonstrates state-of-the-art performance on RGB and 3D skeleton data across H2O, PennAction, and IKEA ASM, excelling in fine-grained tasks and action recognition while showing strong cross-activity generalization. The key contributions are the implicit clip-level clustering approach, the bi-directional matching loss, and the dual augmentation design that yields robust, activity-discriminative representations. These advances offer practical impact for scalable, modality-agnostic video understanding in AR/VR and related domains.

Abstract

Self-supervised temporal sequence alignment can provide rich and effective representations for a wide range of applications. However, existing methods for achieving optimal performance are mostly limited to aligning sequences of the same activity only and require separate models to be trained for each activity. We propose a novel framework that overcomes these limitations using sequence alignment via implicit clustering. Specifically, our key idea is to perform implicit clip-level clustering while aligning frames in sequences. This coupled with our proposed dual augmentation technique enhances the network's ability to learn generalizable and discriminative representations. Our experiments show that our proposed method outperforms state-of-the-art results and highlight the generalization capability of our framework with multi activity and different modalities on three diverse datasets, H2O, PennAction, and IKEA ASM. We will release our code upon acceptance.

Multi Activity Sequence Alignment via Implicit Clustering

TL;DR

MASA addresses the challenge of self-supervised, dense temporal alignment across multiple activities without training separate models per activity. It introduces implicit clip-level clustering coupled with a dual augmentation strategy, integrated into a three-component framework: augmentation, context-aware embeddings, and alignment-cluster learning. The method demonstrates state-of-the-art performance on RGB and 3D skeleton data across H2O, PennAction, and IKEA ASM, excelling in fine-grained tasks and action recognition while showing strong cross-activity generalization. The key contributions are the implicit clip-level clustering approach, the bi-directional matching loss, and the dual augmentation design that yields robust, activity-discriminative representations. These advances offer practical impact for scalable, modality-agnostic video understanding in AR/VR and related domains.

Abstract

Self-supervised temporal sequence alignment can provide rich and effective representations for a wide range of applications. However, existing methods for achieving optimal performance are mostly limited to aligning sequences of the same activity only and require separate models to be trained for each activity. We propose a novel framework that overcomes these limitations using sequence alignment via implicit clustering. Specifically, our key idea is to perform implicit clip-level clustering while aligning frames in sequences. This coupled with our proposed dual augmentation technique enhances the network's ability to learn generalizable and discriminative representations. Our experiments show that our proposed method outperforms state-of-the-art results and highlight the generalization capability of our framework with multi activity and different modalities on three diverse datasets, H2O, PennAction, and IKEA ASM. We will release our code upon acceptance.

Paper Structure

This paper contains 19 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The t-SNE visualization of the learned embeddings on the PennAction dataset. The proposed method effectively distinguishes different activities. On the left, the alignment of two bench pressing sequences is shown, which demonstrates the effectiveness of our approach. On the right, our model not only differentiates between activities like baseball pitch and tennis serve but also recognizes their proximity in the embedding domain. Dashed lines indicate matched frames between sequences.
  • Figure 2: System Overview. Our framework takes RGB images as inputs and obtains RGB features using the frozen DINO oquab2023dinov2 pre-trained model and skeletons extracted using FrankMocap rong2021frankmocap. Dual augmentation generates two different augmented sequences with frame-wise concatenated image features and skeletons. Note that when we use only one modality, we disable the other modality and feed it to the context-aware module without concatenation. Our context-aware module extracts embeddings for the downstream tasks. The alignment module matches and clusters the latent features to learn framewise video embeddings.
  • Figure 3: t-SNE visualization of the learned embedding trained on multi activity on PennAction. (left) CASA shows embeddings fall into a single homogeneous space, while (right) ours can learn more generalizable and discriminative representations. Each color represents each activity.