Table of Contents
Fetching ...

Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs

Camillo Quattrocchi, Antonino Furnari, Daniele Di Mauro, Mario Valerio Giuffrida, Giovanni Maria Farinella

TL;DR

The paper addresses the challenge of transferring temporal action segmentation from exocentric to egocentric video without labeling egocentric data. It introduces a knowledge distillation framework using synchronized unlabeled exocentric-egocentric video pairs, with distillation applied at both feature extraction and TAS model levels. Evaluations on Assembly101 and EgoExo4D demonstrate substantial improvements over unsupervised domain adaptation baselines and, in some cases, match ego-labeled supervision, highlighting the practical potential of synchronized supervision. The approach is robust to limited synchronized data and synchronization jitter, and the authors provide open-source code to facilitate adoption and further research.

Abstract

We consider the problem of transferring a temporal action segmentation system initially designed for exocentric (fixed) cameras to an egocentric scenario, where wearable cameras capture video data. The conventional supervised approach requires the collection and labeling of a new set of egocentric videos to adapt the model, which is costly and time-consuming. Instead, we propose a novel methodology which performs the adaptation leveraging existing labeled exocentric videos and a new set of unlabeled, synchronized exocentric-egocentric video pairs, for which temporal action segmentation annotations do not need to be collected. We implement the proposed methodology with an approach based on knowledge distillation, which we investigate both at the feature and Temporal Action Segmentation model level. Experiments on Assembly101 and EgoExo4D demonstrate the effectiveness of the proposed method against classic unsupervised domain adaptation and temporal alignment approaches. Without bells and whistles, our best model performs on par with supervised approaches trained on labeled egocentric data, without ever seeing a single egocentric label, achieving a +15.99 improvement in the edit score (28.59 vs 12.60) on the Assembly101 dataset compared to a baseline model trained solely on exocentric data. In similar settings, our method also improves edit score by +3.32 on the challenging EgoExo4D benchmark. Code is available here: https://github.com/fpv-iplab/synchronization-is-all-you-need.

Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs

TL;DR

The paper addresses the challenge of transferring temporal action segmentation from exocentric to egocentric video without labeling egocentric data. It introduces a knowledge distillation framework using synchronized unlabeled exocentric-egocentric video pairs, with distillation applied at both feature extraction and TAS model levels. Evaluations on Assembly101 and EgoExo4D demonstrate substantial improvements over unsupervised domain adaptation baselines and, in some cases, match ego-labeled supervision, highlighting the practical potential of synchronized supervision. The approach is robust to limited synchronized data and synchronization jitter, and the authors provide open-source code to facilitate adoption and further research.

Abstract

We consider the problem of transferring a temporal action segmentation system initially designed for exocentric (fixed) cameras to an egocentric scenario, where wearable cameras capture video data. The conventional supervised approach requires the collection and labeling of a new set of egocentric videos to adapt the model, which is costly and time-consuming. Instead, we propose a novel methodology which performs the adaptation leveraging existing labeled exocentric videos and a new set of unlabeled, synchronized exocentric-egocentric video pairs, for which temporal action segmentation annotations do not need to be collected. We implement the proposed methodology with an approach based on knowledge distillation, which we investigate both at the feature and Temporal Action Segmentation model level. Experiments on Assembly101 and EgoExo4D demonstrate the effectiveness of the proposed method against classic unsupervised domain adaptation and temporal alignment approaches. Without bells and whistles, our best model performs on par with supervised approaches trained on labeled egocentric data, without ever seeing a single egocentric label, achieving a +15.99 improvement in the edit score (28.59 vs 12.60) on the Assembly101 dataset compared to a baseline model trained solely on exocentric data. In similar settings, our method also improves edit score by +3.32 on the challenging EgoExo4D benchmark. Code is available here: https://github.com/fpv-iplab/synchronization-is-all-you-need.
Paper Structure (25 sections, 4 equations, 9 figures, 9 tables)

This paper contains 25 sections, 4 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: We consider the problem of transferring a temporal action segmentation model from an exocentric to an egocentric setup. 1 We assume that a set of labeled exocentric videos is available to train an exocentric temporal action segmentation model. 2 When tested on egocentric data, the exocentric model exhibits poor performance due to domain shift. 3 We propose to use an exocentric-to-egocentric adaptation process to adapt the exocentric model using a set of synchronized unlabeled exocentric-egocentric video pairs. 4 At test time, the model is able to operate on egocentric data, with no access to exocentric videos.
  • Figure 1: Features Distillation processes.
  • Figure 2: The proposed method to adapt an exocentric Temporal Action Segmentation model trained on $\mathcal{D}_{train}^{exo}$ to an egocentric setting using a set of unlabelled synchronized video pairs $\mathcal{D}_{adapt}^{pair}$. We investigate distillation at two different levels: the feature extractor and the model. The bottom branch (exocentric) is used to supervise the top branch (egocentric).
  • Figure 2: TAS Model distillation process.
  • Figure 3: Three pairs of synchronized exo-ego (v3, e4) frames from Assembly101.
  • ...and 4 more figures