Table of Contents
Fetching ...

Event-based Facial Keypoint Alignment via Cross-Modal Fusion Attention and Self-Supervised Multi-Event Representation Learning

Donghwa Kang, Junho Kim, Dongwoo Kang

Abstract

Event cameras offer unique advantages for facial keypoint alignment under challenging conditions, such as low light and rapid motion, due to their high temporal resolution and robustness to varying illumination. However, existing RGB facial keypoint alignment methods do not perform well on event data, and training solely on event data often leads to suboptimal performance because of its limited spatial information. Moreover, the lack of comprehensive labeled event datasets further hinders progress in this area. To address these issues, we propose a novel framework based on cross-modal fusion attention (CMFA) and self-supervised multi-event representation learning (SSMER) for event-based facial keypoint alignment. Our framework employs CMFA to integrate corresponding RGB data, guiding the model to extract robust facial features from event input images. In parallel, SSMER enables effective feature learning from unlabeled event data, overcoming spatial limitations. Extensive experiments on our real-event E-SIE dataset and a synthetic-event version of the public WFLW-V benchmark show that our approach consistently surpasses state-of-the-art methods across multiple evaluation metrics.

Event-based Facial Keypoint Alignment via Cross-Modal Fusion Attention and Self-Supervised Multi-Event Representation Learning

Abstract

Event cameras offer unique advantages for facial keypoint alignment under challenging conditions, such as low light and rapid motion, due to their high temporal resolution and robustness to varying illumination. However, existing RGB facial keypoint alignment methods do not perform well on event data, and training solely on event data often leads to suboptimal performance because of its limited spatial information. Moreover, the lack of comprehensive labeled event datasets further hinders progress in this area. To address these issues, we propose a novel framework based on cross-modal fusion attention (CMFA) and self-supervised multi-event representation learning (SSMER) for event-based facial keypoint alignment. Our framework employs CMFA to integrate corresponding RGB data, guiding the model to extract robust facial features from event input images. In parallel, SSMER enables effective feature learning from unlabeled event data, overcoming spatial limitations. Extensive experiments on our real-event E-SIE dataset and a synthetic-event version of the public WFLW-V benchmark show that our approach consistently surpasses state-of-the-art methods across multiple evaluation metrics.

Paper Structure

This paper contains 26 sections, 8 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Our framework combines the rich spatial detail of RGB frames with the robust motion cues of event data. We train the event backbone through self-supervised multi-event representation learning and integrate both modalities with cross-modal fusion attention in a transformer-based pipeline.
  • Figure 2: An overview of our event-based facial keypoint alignment pipeline. The synchronized RGB and event streams each pass through feature extraction: the RGB branch employs a pretrained backbone, while the event branch uses our SSMER backbone. The extracted features, together with their respective structure encodings, are then fused by CMFA, followed by MSA and MCA to refine landmark-specific patches. A shared alignment head subsequently predicts the facial landmarks.
  • Figure 3: Detailed layout of our facial keypoint alignment module, illustrating how CMFA works in with MSA and MCA. CMFA receives the query from the input embeddings and obtains the key and value from RGB patch embeddings with structure encoding, fusing them. The resulting refined features then flow through MSA and MCA, where event patch embeddings with structure encoding further guide the final keypoint alignment.
  • Figure 4: Illustration of our SSMER pipeline. (a) Three representation pairs are processed via contrastive learning, and the resulting losses are summed into a multi-representation loss. (b) A detailed view of our contrastive learning procedure for each pair.
  • Figure 5: Construction pipeline of the synthetic E-CelebV-HQ dataset.
  • ...and 5 more figures