Table of Contents
Fetching ...

2by2: Weakly-Supervised Learning for Global Action Segmentation

Elena Bueno-Benito, Mariella Dimiccoli

TL;DR

2by2 tackles global action segmentation under weak supervision by leveraging binary video-pair labels. It introduces a transformer-based Siamese architecture with a triadic loss that enforces intra-video discrimination, inter-video associations, and inter-activity associations, augmented by a context-drop module to handle background frames. The training combines a global, activity, and video-level objective, and includes a novel use of video alignment concepts such as temporal cycle consistency, yielding state-of-the-art results on the BF and YTI benchmarks. The approach demonstrates strong generalization across activities and provides a principled framework for learning shared action representations without transcripts. Overall, 2by2 bridges global action understanding and weak supervision, delivering robust action clustering and alignment across diverse activities.

Abstract

This paper presents a simple yet effective approach for the poorly investigated task of global action segmentation, aiming at grouping frames capturing the same action across videos of different activities. Unlike the case of videos depicting all the same activity, the temporal order of actions is not roughly shared among all videos, making the task even more challenging. We propose to use activity labels to learn, in a weakly-supervised fashion, action representations suitable for global action segmentation. For this purpose, we introduce a triadic learning approach for video pairs, to ensure intra-video action discrimination, as well as inter-video and inter-activity action association. For the backbone architecture, we use a Siamese network based on sparse transformers that takes as input video pairs and determine whether they belong to the same activity. The proposed approach is validated on two challenging benchmark datasets: Breakfast and YouTube Instructions, outperforming state-of-the-art methods.

2by2: Weakly-Supervised Learning for Global Action Segmentation

TL;DR

2by2 tackles global action segmentation under weak supervision by leveraging binary video-pair labels. It introduces a transformer-based Siamese architecture with a triadic loss that enforces intra-video discrimination, inter-video associations, and inter-activity associations, augmented by a context-drop module to handle background frames. The training combines a global, activity, and video-level objective, and includes a novel use of video alignment concepts such as temporal cycle consistency, yielding state-of-the-art results on the BF and YTI benchmarks. The approach demonstrates strong generalization across activities and provides a principled framework for learning shared action representations without transcripts. Overall, 2by2 bridges global action understanding and weak supervision, delivering robust action clustering and alignment across diverse activities.

Abstract

This paper presents a simple yet effective approach for the poorly investigated task of global action segmentation, aiming at grouping frames capturing the same action across videos of different activities. Unlike the case of videos depicting all the same activity, the temporal order of actions is not roughly shared among all videos, making the task even more challenging. We propose to use activity labels to learn, in a weakly-supervised fashion, action representations suitable for global action segmentation. For this purpose, we introduce a triadic learning approach for video pairs, to ensure intra-video action discrimination, as well as inter-video and inter-activity action association. For the backbone architecture, we use a Siamese network based on sparse transformers that takes as input video pairs and determine whether they belong to the same activity. The proposed approach is validated on two challenging benchmark datasets: Breakfast and YouTube Instructions, outperforming state-of-the-art methods.

Paper Structure

This paper contains 29 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Our approach compares video pairs through a Siamese network by using binary labels indicating if the videos belong to the same activity or not. We propose a triadic loss function modelling intra-video discrimination, inter-video and inter-activity associations for clustering actions across videos of different activities.
  • Figure 2: Overview of the proposed 2by2 framework. The figure illustrates our triadic learning approach: intra-video action discrimination, which enhances cross-temporal consistency within a single video (first box); inter-video action associations, which align action frames among similar videos (second box); and inter-activity action associations, which establish global correspondence between different videos (third box). The red arrows indicate steps specific to the training phase.
  • Figure 3: Examples from BF ("scrambled egg" and "fried egg" activities). Comparison of ground truth (GT) segmentation and our 2by2 framework. 2by2 discovers common action steps across activities (see yellow segments) and captures the cyclic nature of the videos (see purple segments).