Multi Activity Sequence Alignment via Implicit Clustering
Taein Kwon, Zador Pataki, Mahdi Rad, Marc Pollefeys
TL;DR
MASA addresses the challenge of self-supervised, dense temporal alignment across multiple activities without training separate models per activity. It introduces implicit clip-level clustering coupled with a dual augmentation strategy, integrated into a three-component framework: augmentation, context-aware embeddings, and alignment-cluster learning. The method demonstrates state-of-the-art performance on RGB and 3D skeleton data across H2O, PennAction, and IKEA ASM, excelling in fine-grained tasks and action recognition while showing strong cross-activity generalization. The key contributions are the implicit clip-level clustering approach, the bi-directional matching loss, and the dual augmentation design that yields robust, activity-discriminative representations. These advances offer practical impact for scalable, modality-agnostic video understanding in AR/VR and related domains.
Abstract
Self-supervised temporal sequence alignment can provide rich and effective representations for a wide range of applications. However, existing methods for achieving optimal performance are mostly limited to aligning sequences of the same activity only and require separate models to be trained for each activity. We propose a novel framework that overcomes these limitations using sequence alignment via implicit clustering. Specifically, our key idea is to perform implicit clip-level clustering while aligning frames in sequences. This coupled with our proposed dual augmentation technique enhances the network's ability to learn generalizable and discriminative representations. Our experiments show that our proposed method outperforms state-of-the-art results and highlight the generalization capability of our framework with multi activity and different modalities on three diverse datasets, H2O, PennAction, and IKEA ASM. We will release our code upon acceptance.
