Table of Contents
Fetching ...

Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition

Cagri Gungor, Adriana Kovashka

TL;DR

This work proposes a multi-modal framework that improves domain generalization by integrating motion, audio, and appearance features and achieves state-of-the-art performance on the ARGO1M dataset, effectively generalizing across unseen scenarios and locations.

Abstract

First-person activity recognition is rapidly growing due to the widespread use of wearable cameras but faces challenges from domain shifts across different environments, such as varying objects or background scenes. We propose a multimodal framework that improves domain generalization by integrating motion, audio, and appearance features. Key contributions include analyzing the resilience of audio and motion features to domain shifts, using audio narrations for enhanced audio-text alignment, and applying consistency ratings between audio and visual narrations to optimize the impact of audio in recognition during training. Our approach achieves state-of-the-art performance on the ARGO1M dataset, effectively generalizing across unseen scenarios and locations.

Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition

TL;DR

This work proposes a multi-modal framework that improves domain generalization by integrating motion, audio, and appearance features and achieves state-of-the-art performance on the ARGO1M dataset, effectively generalizing across unseen scenarios and locations.

Abstract

First-person activity recognition is rapidly growing due to the widespread use of wearable cameras but faces challenges from domain shifts across different environments, such as varying objects or background scenes. We propose a multimodal framework that improves domain generalization by integrating motion, audio, and appearance features. Key contributions include analyzing the resilience of audio and motion features to domain shifts, using audio narrations for enhanced audio-text alignment, and applying consistency ratings between audio and visual narrations to optimize the impact of audio in recognition during training. Our approach achieves state-of-the-art performance on the ARGO1M dataset, effectively generalizing across unseen scenarios and locations.
Paper Structure (9 sections, 2 equations, 2 figures, 4 tables)

This paper contains 9 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Illustration of motion and audio resilience to domain shifts compared to appearance. While the motion of 'cutting' (first row) and audio of 'pouring' (second row) remain similar across different scenario-location domains, the appearance varies significantly with different objects and backgrounds.
  • Figure 2: The proposed framework extracts appearance $ap_i$, motion $m_i$, and audio $a_i$ embeddings using trained encoders $f$. Visual-text and audio-text alignments are performed independently to enhance the robustness of action representations. Consistency rating $r_i$, calculated offline using a LLM touvron2023llama, is then multiplied by audio embedding, optimizing the influence of audio in multimodal prediction. Note that narrations and consistency rate are only utilized during training to improve representation learning. During inference, embeddings are directly fused before prediction.