Table of Contents
Fetching ...

Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition

Masashi Hatano, Ryo Hachiuma, Ryo Fujii, Hideo Saito

TL;DR

The paper tackles cross-domain few-shot learning for egocentric action recognition with unlabeled target data, focusing on bridging large domain gaps and reducing inference cost. It introduces MM-CDFSL, which combines domain-adapted, class-discriminative pretraining with multimodal distillation into an RGB backbone and an ensemble masked inference strategy to cut computation. Key contributions include per-modality VideoMAE-based pretraining, multimodal distillation to RGB, and Tube Masking-driven ensemble inference, achieving about 6.1 points gains in 1-shot and 5-shot accuracy and a 2.2x speed-up over prior CD-FSL methods across Ego4D-to-target benchmarks. The approach yields substantial efficiency benefits (46% GFLOPs reduction, 34% memory savings) while maintaining or surpassing state-of-the-art accuracy, enabling practical deployment on resource-constrained setups. These results demonstrate the value of multimodal information and masked ensemble techniques in robust cross-domain egocentric action recognition.

Abstract

We address a novel cross-domain few-shot learning task (CD-FSL) with multimodal input and unlabeled target data for egocentric action recognition. This paper simultaneously tackles two critical challenges associated with egocentric action recognition in CD-FSL settings: (1) the extreme domain gap in egocentric videos (e.g., daily life vs. industrial domain) and (2) the computational cost for real-world applications. We propose MM-CDFSL, a domain-adaptive and computationally efficient approach designed to enhance adaptability to the target domain and improve inference cost. To address the first challenge, we propose the incorporation of multimodal distillation into the student RGB model using teacher models. Each teacher model is trained independently on source and target data for its respective modality. Leveraging only unlabeled target data during multimodal distillation enhances the student model's adaptability to the target domain. We further introduce ensemble masked inference, a technique that reduces the number of input tokens through masking. In this approach, ensemble prediction mitigates the performance degradation caused by masking, effectively addressing the second issue. Our approach outperformed the state-of-the-art CD-FSL approaches with a substantial margin on multiple egocentric datasets, improving by an average of 6.12/6.10 points for 1-shot/5-shot settings while achieving $2.2$ times faster inference speed. Project page: https://masashi-hatano.github.io/MM-CDFSL/

Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition

TL;DR

The paper tackles cross-domain few-shot learning for egocentric action recognition with unlabeled target data, focusing on bridging large domain gaps and reducing inference cost. It introduces MM-CDFSL, which combines domain-adapted, class-discriminative pretraining with multimodal distillation into an RGB backbone and an ensemble masked inference strategy to cut computation. Key contributions include per-modality VideoMAE-based pretraining, multimodal distillation to RGB, and Tube Masking-driven ensemble inference, achieving about 6.1 points gains in 1-shot and 5-shot accuracy and a 2.2x speed-up over prior CD-FSL methods across Ego4D-to-target benchmarks. The approach yields substantial efficiency benefits (46% GFLOPs reduction, 34% memory savings) while maintaining or surpassing state-of-the-art accuracy, enabling practical deployment on resource-constrained setups. These results demonstrate the value of multimodal information and masked ensemble techniques in robust cross-domain egocentric action recognition.

Abstract

We address a novel cross-domain few-shot learning task (CD-FSL) with multimodal input and unlabeled target data for egocentric action recognition. This paper simultaneously tackles two critical challenges associated with egocentric action recognition in CD-FSL settings: (1) the extreme domain gap in egocentric videos (e.g., daily life vs. industrial domain) and (2) the computational cost for real-world applications. We propose MM-CDFSL, a domain-adaptive and computationally efficient approach designed to enhance adaptability to the target domain and improve inference cost. To address the first challenge, we propose the incorporation of multimodal distillation into the student RGB model using teacher models. Each teacher model is trained independently on source and target data for its respective modality. Leveraging only unlabeled target data during multimodal distillation enhances the student model's adaptability to the target domain. We further introduce ensemble masked inference, a technique that reduces the number of input tokens through masking. In this approach, ensemble prediction mitigates the performance degradation caused by masking, effectively addressing the second issue. Our approach outperformed the state-of-the-art CD-FSL approaches with a substantial margin on multiple egocentric datasets, improving by an average of 6.12/6.10 points for 1-shot/5-shot settings while achieving times faster inference speed. Project page: https://masashi-hatano.github.io/MM-CDFSL/
Paper Structure (17 sections, 7 equations, 5 figures, 4 tables)

This paper contains 17 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Problem setup. In our problem setup, a model is trained using source data and unlabeled target data for multiple modalities during the meta-training stage. In the meta-testing stage, a few examples of novel classes from the support set are provided to learn a classifier. Then, the network predicts the categories of different samples from the query set, which are the same classes as the support set. Unlike existing setups, we leverage multimodal data (e.g., optical flows or hand poses) during the meta-training stage. During the meta-testing stage, only RGB videos are used as inputs.
  • Figure 2: The framework of our proposed method. Our approach has two meta-training and two meta-testing stages: 1. learning domain-adapted and class-discriminative features for all modalities, 2. distilling the multimodal features into student RGB encoders, 3. few-shot learning for adapting novel classes, and 4. ensemble masked inference using $P$ Tube Masking during inference.
  • Figure 3: Accuracy vs. inference time. The trade-off analysis between action recognition accuracy and inference speed is conducted for the existing method and our proposed approach, examining various masking ratios $\rho_{\text{infer}}$ and ensemble numbers $P$. The number near the plots for our proposed method denotes the ensemble number.
  • Figure 4: Samples from each dataset. A curated selection of RGB images from each dataset showcases the domain gap between the source and target datasets.
  • Figure 5: Comparative UMAP visualization of feature representations. UMAP plot of 10 classes from EPIC-Kitchens validation set with features obtained from (a) Only RGB (only reconstruction), (b) Only RGB, and (c) Ours.