Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition
Masashi Hatano, Ryo Hachiuma, Ryo Fujii, Hideo Saito
TL;DR
The paper tackles cross-domain few-shot learning for egocentric action recognition with unlabeled target data, focusing on bridging large domain gaps and reducing inference cost. It introduces MM-CDFSL, which combines domain-adapted, class-discriminative pretraining with multimodal distillation into an RGB backbone and an ensemble masked inference strategy to cut computation. Key contributions include per-modality VideoMAE-based pretraining, multimodal distillation to RGB, and Tube Masking-driven ensemble inference, achieving about 6.1 points gains in 1-shot and 5-shot accuracy and a 2.2x speed-up over prior CD-FSL methods across Ego4D-to-target benchmarks. The approach yields substantial efficiency benefits (46% GFLOPs reduction, 34% memory savings) while maintaining or surpassing state-of-the-art accuracy, enabling practical deployment on resource-constrained setups. These results demonstrate the value of multimodal information and masked ensemble techniques in robust cross-domain egocentric action recognition.
Abstract
We address a novel cross-domain few-shot learning task (CD-FSL) with multimodal input and unlabeled target data for egocentric action recognition. This paper simultaneously tackles two critical challenges associated with egocentric action recognition in CD-FSL settings: (1) the extreme domain gap in egocentric videos (e.g., daily life vs. industrial domain) and (2) the computational cost for real-world applications. We propose MM-CDFSL, a domain-adaptive and computationally efficient approach designed to enhance adaptability to the target domain and improve inference cost. To address the first challenge, we propose the incorporation of multimodal distillation into the student RGB model using teacher models. Each teacher model is trained independently on source and target data for its respective modality. Leveraging only unlabeled target data during multimodal distillation enhances the student model's adaptability to the target domain. We further introduce ensemble masked inference, a technique that reduces the number of input tokens through masking. In this approach, ensemble prediction mitigates the performance degradation caused by masking, effectively addressing the second issue. Our approach outperformed the state-of-the-art CD-FSL approaches with a substantial margin on multiple egocentric datasets, improving by an average of 6.12/6.10 points for 1-shot/5-shot settings while achieving $2.2$ times faster inference speed. Project page: https://masashi-hatano.github.io/MM-CDFSL/
