Hierarchical Vector Quantization for Unsupervised Action Segmentation
Federico Spurio, Emad Bahrami, Gianpiero Francesca, Juergen Gall
TL;DR
This work tackles unsupervised temporal action segmentation in long videos by introducing Hierarchical Vector Quantization (HVQ), a two-level vector-quantization framework that learns fine-to-coarse action clusters via codebooks $Z$ and $Q$. The model is trained end-to-end with reconstruction and commitment losses, updates prototypes online, and employs FIFA decoding for temporally smoothed inference, without relying on pseudo-labels. It also introduces a Jensen-Shannon Distance-based metric to quantify segment-length bias, and demonstrates state-of-the-art results across Breakfast, YouTube Instructional, and IKEA ASM, with improved recall and F1 and reduced length-bias. The approach shows that hierarchical clustering better captures intra-action variation and yields stable cross-video action representations, offering practical benefits for fully unsupervised temporal understanding in diverse domains.
Abstract
In this work, we address unsupervised temporal action segmentation, which segments a set of long, untrimmed videos into semantically meaningful segments that are consistent across videos. While recent approaches combine representation learning and clustering in a single step for this task, they do not cope with large variations within temporal segments of the same class. To address this limitation, we propose a novel method, termed Hierarchical Vector Quantization (HVQ), that consists of two subsequent vector quantization modules. This results in a hierarchical clustering where the additional subclusters cover the variations within a cluster. We demonstrate that our approach captures the distribution of segment lengths much better than the state of the art. To this end, we introduce a new metric based on the Jensen-Shannon Distance (JSD) for unsupervised temporal action segmentation. We evaluate our approach on three public datasets, namely Breakfast, YouTube Instructional and IKEA ASM. Our approach outperforms the state of the art in terms of F1 score, recall and JSD.
