Table of Contents
Fetching ...

Dual-Stream Alignment for Action Segmentation

Harshala Gammulle, Clinton Fookes, Sridha Sridharan, Simon Denman

TL;DR

DSA_Net addresses dense action segmentation by dual-stream alignment of frame-wise features and learnable action tokens, fused through a Temporal Context block with quantum-enhanced modulation. The framework introduces a hybrid quantum-classical Q-ActGM pathway and a three-component dual-stream alignment loss (relational consistency, cross-level contrastive, cycle-consistency reconstruction) to distill complementary information across streams. Evaluations on Breakfast, GTEA, 50Salads, and EgoProceL show state-of-the-art performance with consistent improvements in frame-wise accuracy and segmentation metrics, validated by comprehensive ablations of the components and hyperparameters. This work advances action segmentation by integrating quantum-inspired feature modulation into cross-stream fusion, offering a new direction for hybrid quantum-classical video understanding with practical performance gains.

Abstract

Action segmentation is a challenging yet active research area that involves identifying when and where specific actions occur in continuous video streams. Most existing work has focused on single-stream approaches that model the spatio-temporal aspects of frame sequences. However, recent research has shifted toward two-stream methods that learn action-wise features to enhance action segmentation performance. In this work, we propose the Dual-Stream Alignment Network (DSA Net) and investigate the impact of incorporating a second stream of learned action features to guide segmentation by capturing both action and action-transition cues. Communication between the two streams is facilitated by a Temporal Context (TC) block, which fuses complementary information using cross-attention and Quantum-based Action-Guided Modulation (Q-ActGM), enhancing the expressive power of the fused features. To the best of our knowledge, this is the first study to introduce a hybrid quantum-classical machine learning framework for action segmentation. Our primary objective is for the two streams (frame-wise and action-wise) to learn a shared feature space through feature alignment. This is encouraged by the proposed Dual-Stream Alignment Loss, which comprises three components: relational consistency, cross-level contrastive, and cycle-consistency reconstruction losses. Following prior work, we evaluate DSA Net on several diverse benchmark datasets: GTEA, Breakfast, 50Salads, and EgoProcel. We further demonstrate the effectiveness of each component through extensive ablation studies. Notably, DSA Net achieves state-of-the-art performance, significantly outperforming existing

Dual-Stream Alignment for Action Segmentation

TL;DR

DSA_Net addresses dense action segmentation by dual-stream alignment of frame-wise features and learnable action tokens, fused through a Temporal Context block with quantum-enhanced modulation. The framework introduces a hybrid quantum-classical Q-ActGM pathway and a three-component dual-stream alignment loss (relational consistency, cross-level contrastive, cycle-consistency reconstruction) to distill complementary information across streams. Evaluations on Breakfast, GTEA, 50Salads, and EgoProceL show state-of-the-art performance with consistent improvements in frame-wise accuracy and segmentation metrics, validated by comprehensive ablations of the components and hyperparameters. This work advances action segmentation by integrating quantum-inspired feature modulation into cross-stream fusion, offering a new direction for hybrid quantum-classical video understanding with practical performance gains.

Abstract

Action segmentation is a challenging yet active research area that involves identifying when and where specific actions occur in continuous video streams. Most existing work has focused on single-stream approaches that model the spatio-temporal aspects of frame sequences. However, recent research has shifted toward two-stream methods that learn action-wise features to enhance action segmentation performance. In this work, we propose the Dual-Stream Alignment Network (DSA Net) and investigate the impact of incorporating a second stream of learned action features to guide segmentation by capturing both action and action-transition cues. Communication between the two streams is facilitated by a Temporal Context (TC) block, which fuses complementary information using cross-attention and Quantum-based Action-Guided Modulation (Q-ActGM), enhancing the expressive power of the fused features. To the best of our knowledge, this is the first study to introduce a hybrid quantum-classical machine learning framework for action segmentation. Our primary objective is for the two streams (frame-wise and action-wise) to learn a shared feature space through feature alignment. This is encouraged by the proposed Dual-Stream Alignment Loss, which comprises three components: relational consistency, cross-level contrastive, and cycle-consistency reconstruction losses. Following prior work, we evaluate DSA Net on several diverse benchmark datasets: GTEA, Breakfast, 50Salads, and EgoProcel. We further demonstrate the effectiveness of each component through extensive ablation studies. Notably, DSA Net achieves state-of-the-art performance, significantly outperforming existing

Paper Structure

This paper contains 24 sections, 23 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: The Dual-Stream Alignment Network (DSA_Net) supports action segmentation by aligning two streams of features, namely frame features and learnable action tokens, via the Dual-Stream Alignment Loss. Feature fusion across streams is facilitated by the Temporal Context (TC) block, which integrates cross-attention with quantum properties through the proposed Quantum-based Action-Guided Modulation (Q-ActGM) layer.
  • Figure 2: Overview of the proposed DSA_Net: The model maintains two streams of features, frame features and action tokens, while modelling their temporal dynamics through Temporal Encoders (TE) and a Global Encoder (GE), respectively. Feature fusion is performed via the Temporal Context (TC) block, within the Temporal Sequence Alignment (TSA) block. The TC block integrates a cross-attention mechanism with the proposed Quantum-based Action-Guided Modulation (Q-ActGM), which introduces quantum properties to enhance expressive power. Feature alignment is encouraged through the proposed Dual-Stream Alignment loss.
  • Figure 3: Visualisation of the action segmentation results on the Breakfast dataset.
  • Figure 4: Visualisation of the action segmentation results on the GTEA dataset.
  • Figure 5: Visualisation of the action segmentation results on the 50 Salads dataset.
  • ...and 2 more figures