Table of Contents
Fetching ...

Personalized Federated Learning for Egocentric Video Gaze Estimation with Comprehensive Parameter Frezzing

Yuhu Feng, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

TL;DR

Problem addressed: private, personalized gaze estimation from egocentric video across users. Approach: FedCPF, a personalized federated learning framework that comprehensively freezes parameters with the largest average rate of change $\Delta_{v_i^r}^{\rm{avg}}$ across iterations, within a transformer-based Global-Local Correlation backbone. Key results: FedCPF surpasses FedAvg, FedProx, FedPAC, and FedSelect on EGTEA Gaze+ and Ego4D, with significant gains in Recall and Precision and ablation evidence that the average-rate criterion is beneficial. Significance: demonstrates effective privacy-preserving personalization for egocentric gaze tasks and offers a practical approach for AR/VR and assistive technologies without centralizing data.

Abstract

Egocentric video gaze estimation requires models to capture individual gaze patterns while adapting to diverse user data. Our approach leverages a transformer-based architecture, integrating it into a PFL framework where only the most significant parameters, those exhibiting the highest rate of change during training, are selected and frozen for personalization in client models. Through extensive experimentation on the EGTEA Gaze+ and Ego4D datasets, we demonstrate that FedCPF significantly outperforms previously reported federated learning methods, achieving superior recall, precision, and F1-score. These results confirm the effectiveness of our comprehensive parameters freezing strategy in enhancing model personalization, making FedCPF a promising approach for tasks requiring both adaptability and accuracy in federated learning settings.

Personalized Federated Learning for Egocentric Video Gaze Estimation with Comprehensive Parameter Frezzing

TL;DR

Problem addressed: private, personalized gaze estimation from egocentric video across users. Approach: FedCPF, a personalized federated learning framework that comprehensively freezes parameters with the largest average rate of change across iterations, within a transformer-based Global-Local Correlation backbone. Key results: FedCPF surpasses FedAvg, FedProx, FedPAC, and FedSelect on EGTEA Gaze+ and Ego4D, with significant gains in Recall and Precision and ablation evidence that the average-rate criterion is beneficial. Significance: demonstrates effective privacy-preserving personalization for egocentric gaze tasks and offers a practical approach for AR/VR and assistive technologies without centralizing data.

Abstract

Egocentric video gaze estimation requires models to capture individual gaze patterns while adapting to diverse user data. Our approach leverages a transformer-based architecture, integrating it into a PFL framework where only the most significant parameters, those exhibiting the highest rate of change during training, are selected and frozen for personalization in client models. Through extensive experimentation on the EGTEA Gaze+ and Ego4D datasets, we demonstrate that FedCPF significantly outperforms previously reported federated learning methods, achieving superior recall, precision, and F1-score. These results confirm the effectiveness of our comprehensive parameters freezing strategy in enhancing model personalization, making FedCPF a promising approach for tasks requiring both adaptability and accuracy in federated learning settings.

Paper Structure

This paper contains 10 sections, 5 equations, 3 figures, 3 tables, 2 algorithms.

Figures (3)

  • Figure 1: An overview of the visual token embedding process in the GLC module. Each input video frame is divided into patches, processed by convolution, and embedded into global and local tokens for further analysis. The global token ($\textbf{x}_1$) and local tokens ($\textbf{x}_2$ to $\textbf{x}_{\phi+1}$) are generated for subsequent processing.
  • Figure 2: An overview of FedCPF algorithm. Input video sequences are embedded into tokens, processed by client models using a transformer with a Global-Local Correlation mechanism. Personalized parameters are frozen locally, while global parameters are aggregated to update the global model. $u_i^r$ and $v_i^r$ are the personalized$/$global parameters, while $u_i^{r+}$ and $v_i^{r+}$ denote the updated personalized/global parameters after MaskUpgrade, respectively. And $\theta_g^r$ represents the current global model.
  • Figure 3: Qualitative evaluation result for gaze estimation. The predicted gaze is represented as a heatmap overlaid on input frames. In the Ground Truth image, the red dots refer to the actual gaze fixation records in the dataset.