Table of Contents
Fetching ...

Towards Generalizing Temporal Action Segmentation to Unseen Views

Emad Bahrami, Olga Zatsarynna, Gianpiero Francesca, Juergen Gall

TL;DR

The paper tackles temporal action segmentation under unseen camera views by defining a protocol and proposing a Siamese-based framework that learns view-robust representations at both the sequence and action levels. It introduces two losses, $\mathcal{L}_{seq}$ and $\mathcal{L}_{action}$, to align video and action representations across views, enabling generalization to unseen exocentric and egocentric perspectives. Evaluations on Assembly101, IkeaASM, and EgoExoLearn show substantial improvements in F1@50 for unseen views (up to $+54\%$ for ego) while maintaining or improving performance on seen views, with analysis of ablations and qualitative results supporting the effectiveness of the approach. The method demonstrates practical impact by reducing the need for new multi-view training data and advancing robust TAS deployment in real-world, multi-view environments.

Abstract

While there has been substantial progress in temporal action segmentation, the challenge to generalize to unseen views remains unaddressed. Hence, we define a protocol for unseen view action segmentation where camera views for evaluating the model are unavailable during training. This includes changing from top-frontal views to a side view or even more challenging from exocentric to egocentric views. Furthermore, we present an approach for temporal action segmentation that tackles this challenge. Our approach leverages a shared representation at both the sequence and segment levels to reduce the impact of view differences during training. We achieve this by introducing a sequence loss and an action loss, which together facilitate consistent video and action representations across different views. The evaluation on the Assembly101, IkeaASM, and EgoExoLearn datasets demonstrate significant improvements, with a 12.8% increase in F1@50 for unseen exocentric views and a substantial 54% improvement for unseen egocentric views.

Towards Generalizing Temporal Action Segmentation to Unseen Views

TL;DR

The paper tackles temporal action segmentation under unseen camera views by defining a protocol and proposing a Siamese-based framework that learns view-robust representations at both the sequence and action levels. It introduces two losses, and , to align video and action representations across views, enabling generalization to unseen exocentric and egocentric perspectives. Evaluations on Assembly101, IkeaASM, and EgoExoLearn show substantial improvements in F1@50 for unseen views (up to for ego) while maintaining or improving performance on seen views, with analysis of ablations and qualitative results supporting the effectiveness of the approach. The method demonstrates practical impact by reducing the need for new multi-view training data and advancing robust TAS deployment in real-world, multi-view environments.

Abstract

While there has been substantial progress in temporal action segmentation, the challenge to generalize to unseen views remains unaddressed. Hence, we define a protocol for unseen view action segmentation where camera views for evaluating the model are unavailable during training. This includes changing from top-frontal views to a side view or even more challenging from exocentric to egocentric views. Furthermore, we present an approach for temporal action segmentation that tackles this challenge. Our approach leverages a shared representation at both the sequence and segment levels to reduce the impact of view differences during training. We achieve this by introducing a sequence loss and an action loss, which together facilitate consistent video and action representations across different views. The evaluation on the Assembly101, IkeaASM, and EgoExoLearn datasets demonstrate significant improvements, with a 12.8% increase in F1@50 for unseen exocentric views and a substantial 54% improvement for unseen egocentric views.

Paper Structure

This paper contains 15 sections, 5 equations, 19 figures, 7 tables.

Figures (19)

  • Figure 1: In this work, we address the problem of temporal action segmentation (TAS) on unseen views. During training, we observe at least two different views of long video sequences that have been frame-wise annotated by the occurring actions. Despite using a standard loss for temporal action segmentation $\mathcal{L}_{TAS}$, we propose a sequence $\mathcal{L}_{seq}$ and an action loss $\mathcal{L}_{action}$ that increase the generalization to unseen views without reducing the accuracy on seen views. The model can then be deployed in a setting with unseen views that are very different to the views in the training set.
  • Figure 2: To train a network for temporal action segmentation that can be applied to views that are not part of the training data, we propose a sequence loss (top) and an action loss (bottom). The sequence loss takes two different camera views as input and computes the frame-wise similarity between the views (cosine similarity). In addition, a standard loss for temporal action segmentation (TAS) is applied to both views. The action loss takes randomly chosen action segments of the same action (green) as input and computes the frame-wise similarity between the action segments. The weights of $\mathcal{E}_{\theta}$ are shared between the two branches.
  • Figure 3: Camera views of the Assembly101 dataset. We use $\mathcal{V}_{seen} = \{Exo_{1}, \dots, Exo_{6}\}$ as seen views. As unseen views, we use $\mathcal{V}^{exo}_{unseen} = \{ Exo_{7}, Exo_{8} \}$ and $\mathcal{V}^{ego}_{unseen} = \{ Ego_{1}, Ego_{2}, Ego_{3}, Ego_{4} \}$.
  • Figure 4: The F1@50 score shows the impact of adding the Sequence Loss ($\mathcal{L}_{seq}$) and Action Loss ($\mathcal{L}_{action}$) to the Baseline for MViTv2 features. Sequence Loss denotes the addition of $\mathcal{L}_{seq}$ to the Baseline, while Action Loss denotes adding $\mathcal{L}_{action}$ to the Baseline.
  • Figure 5: The plots show the influence of weighting between $\mathcal{L}_{seq}$ and $\mathcal{L}_{action}$, where $\lambda$ represents the weight of $\mathcal{L}_{seq}$ and $\beta$ represents the weight of $\mathcal{L}_{action}$. In the left plots, $\lambda$ is fixed at $0.5$ while $\beta$ varies. In the right plots, $\beta$ is set at $0.2$ while $\lambda$ varies.
  • ...and 14 more figures