Table of Contents
Fetching ...

X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization

Anna Kukleva, Fadime Sener, Edoardo Remelli, Bugra Tekin, Eric Sauser, Bernt Schiele, Shugao Ma

TL;DR

X-MIC introduces a lightweight cross-modal adaptation for vision-language models that injects egocentric video information directly into the frozen VL embedding space. By generating a video-specific a_v via a second visual encoder and ego-spatial-temporal attention, and then adding a_v to each class text embedding, the method achieves strong cross-dataset and zero-shot generalization on nouns and verbs for Ego4D, Epic-Kitchens, and EGTEA. The approach decouples temporal modeling from the frozen visual backbone and yields state-of-the-art performance while maintaining efficient training and inference. The work demonstrates practical potential for real-world AR/robotics applications and provides code for reproducibility.

Abstract

Lately, there has been growing interest in adapting vision-language models (VLMs) to image and third-person video classification due to their success in zero-shot recognition. However, the adaptation of these models to egocentric videos has been largely unexplored. To address this gap, we propose a simple yet effective cross-modal adaptation framework, which we call X-MIC. Using a video adapter, our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space. Our novel adapter architecture retains and improves generalization of the pre-trained VLMs by disentangling learnable temporal modeling and frozen visual encoder. This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization. We evaluate our approach on the Epic-Kitchens, Ego4D, and EGTEA datasets for fine-grained cross-dataset action generalization, demonstrating the effectiveness of our method. Code is available at https://github.com/annusha/xmic

X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization

TL;DR

X-MIC introduces a lightweight cross-modal adaptation for vision-language models that injects egocentric video information directly into the frozen VL embedding space. By generating a video-specific a_v via a second visual encoder and ego-spatial-temporal attention, and then adding a_v to each class text embedding, the method achieves strong cross-dataset and zero-shot generalization on nouns and verbs for Ego4D, Epic-Kitchens, and EGTEA. The approach decouples temporal modeling from the frozen visual backbone and yields state-of-the-art performance while maintaining efficient training and inference. The work demonstrates practical potential for real-world AR/robotics applications and provides code for reproducibility.

Abstract

Lately, there has been growing interest in adapting vision-language models (VLMs) to image and third-person video classification due to their success in zero-shot recognition. However, the adaptation of these models to egocentric videos has been largely unexplored. To address this gap, we propose a simple yet effective cross-modal adaptation framework, which we call X-MIC. Using a video adapter, our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space. Our novel adapter architecture retains and improves generalization of the pre-trained VLMs by disentangling learnable temporal modeling and frozen visual encoder. This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization. We evaluate our approach on the Epic-Kitchens, Ego4D, and EGTEA datasets for fine-grained cross-dataset action generalization, demonstrating the effectiveness of our method. Code is available at https://github.com/annusha/xmic
Paper Structure (16 sections, 4 equations, 3 figures, 11 tables, 1 algorithm)

This paper contains 16 sections, 4 equations, 3 figures, 11 tables, 1 algorithm.

Figures (3)

  • Figure 1: Egocentric video classification with VL models. Top: Standard zero-shot CLIP. As the dominant object in the scene is painting, the model predicts class "painting" while the object of interest is "brush". Bottom: CLIP model with our X-MIC adaptation directly in the shared VL embedding. X-MIC vectors adapt focus of the CLIP model to the hand area, guiding text modality to capture egocentric domain-specific information.
  • Figure 2: Overview of our X-MIC method and previous adaptation methods of VLMs.Baselines: No Fusion is a standard zero-shot video classification method. The average of the frame representations is compared to text representations in the shared VL embedding space. Early Fusion & Uni-Modal is a prompt learning method, where the learnable parameters are concatenated to text tokens and optimized through the text encoder. Subsequently, the text encoder is adapted to the new domain. Early Fusion & Cross-Modal is an extension of Early Fusion & Uni-Modal method, where additional learnable parameters are introduced in the form of an adapter. This adapter maps video representations to embedding space of text tokens, which are then concatenated to learnable prompts and text tokens. Memory consumption, required for forward-backwards pass through the text encoder, expands with respect to all combinations of all text-labels and videos in the batch. Late Fusion & Uni-Modal is a method, where adaptation of both encoders is based on the feature blending of original text and video representations with the adapted corresponding representations. Ours:X-MIC adaptation method falls in Late Fusion & Cross-Modal category. Adapted video features are blended with the original text features. Simple adaptation of text modality to each individual video is efficient as it does not require gradient propagation through text or video encoders. Additionally, we propose to employ Visual Encoder II, offering flexibility in utilizing various types of visual features for conditioning. Note that Visual Encoder I and II can be represented by a single visual encoder, such as the CLIP visual encoder.
  • Figure 3: Ego-Spatio-Temporal Attention Module. It takes a sequence of full frames interleaved with hand crops as input, and outputs X-MIC vector $a_v$, representing video $v$ as a single vector for text conditioning in the shared VL embedding space.