Towards Generalizing Temporal Action Segmentation to Unseen Views
Emad Bahrami, Olga Zatsarynna, Gianpiero Francesca, Juergen Gall
TL;DR
The paper tackles temporal action segmentation under unseen camera views by defining a protocol and proposing a Siamese-based framework that learns view-robust representations at both the sequence and action levels. It introduces two losses, $\mathcal{L}_{seq}$ and $\mathcal{L}_{action}$, to align video and action representations across views, enabling generalization to unseen exocentric and egocentric perspectives. Evaluations on Assembly101, IkeaASM, and EgoExoLearn show substantial improvements in F1@50 for unseen views (up to $+54\%$ for ego) while maintaining or improving performance on seen views, with analysis of ablations and qualitative results supporting the effectiveness of the approach. The method demonstrates practical impact by reducing the need for new multi-view training data and advancing robust TAS deployment in real-world, multi-view environments.
Abstract
While there has been substantial progress in temporal action segmentation, the challenge to generalize to unseen views remains unaddressed. Hence, we define a protocol for unseen view action segmentation where camera views for evaluating the model are unavailable during training. This includes changing from top-frontal views to a side view or even more challenging from exocentric to egocentric views. Furthermore, we present an approach for temporal action segmentation that tackles this challenge. Our approach leverages a shared representation at both the sequence and segment levels to reduce the impact of view differences during training. We achieve this by introducing a sequence loss and an action loss, which together facilitate consistent video and action representations across different views. The evaluation on the Assembly101, IkeaASM, and EgoExoLearn datasets demonstrate significant improvements, with a 12.8% increase in F1@50 for unseen exocentric views and a substantial 54% improvement for unseen egocentric views.
