PCIE_LAM Solution for Ego4D Looking At Me Challenge
Kanokphan Lertniphonphan, Jun Xie, Yaqing Meng, Shijing Wang, Feng Chen, Zhepeng Wang
TL;DR
This work tackles the Ego4D Looking At Me challenge by predicting whether a person in a scene gazes at the camera wearer in egocentric video. It introduces InternLSTM, which combines a frozen InternVL image encoder for spatial features with a Bi-LSTM for temporal dynamics, augmented by a median gaze-smoothing post-processing step to mitigate motion blur artifacts. Through data augmentation, test-time augmentation, and ensemble fusion with baseline gaze models, the approach achieves leading performance (1st place) on the Ego4D LAM leaderboard, with an overall mAP of 0.81 and accuracy of 0.93 on the test set. The results demonstrate robustness to motion blur and demonstrate the value of combining strong spatial representations with temporal modeling and post-processing in gaze-directed egocentric video analysis.
Abstract
This report presents our team's 'PCIE_LAM' solution for the Ego4D Looking At Me Challenge at CVPR2024. The main goal of the challenge is to accurately determine if a person in the scene is looking at the camera wearer, based on a video where the faces of social partners have been localized. Our proposed solution, InternLSTM, consists of an InternVL image encoder and a Bi-LSTM network. The InternVL extracts spatial features, while the Bi-LSTM extracts temporal features. However, this task is highly challenging due to the distance between the person in the scene and the camera movement, which results in significant blurring in the face image. To address the complexity of the task, we implemented a Gaze Smoothing filter to eliminate noise or spikes from the output. Our approach achieved the 1st position in the looking at me challenge with 0.81 mAP and 0.93 accuracy rate. Code is available at https://github.com/KanokphanL/Ego4D_LAM_InternLSTM
