Table of Contents
Fetching ...

Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset

Elisa Ancarani, Julie Tores, Lucile Sassatelli, Rémy Sun, Hui-Yin Wu, Frédéric Precioso

TL;DR

This work investigates how modality-specific explanatory concepts improve multimodal video interpretation. By introducing Concept Modality Specific Datasets (CMSDs) derived from MOByGaze, the authors show that CMSD-based supervision boosts both early and late fusion models, with late fusion notably approaching early fusion performance when trained with CMSD. The study demonstrates reduced modality-attribution errors and enhanced interpretability, underscoring the value of modality-detailed annotations for robust, self-explainable video analysis. The results advocate for integrating human-provided explanations into multimodal training to advance interpretable learning in complex video tasks.

Abstract

We examine the impact of concept-informed supervision on multimodal video interpretation models using MOByGaze, a dataset containing human-annotated explanatory concepts. We introduce Concept Modality Specific Datasets (CMSDs), which consist of data subsets categorized by the modality (visual, textual, or audio) of annotated concepts. Models trained on CMSDs outperform those using traditional legacy training in both early and late fusion approaches. Notably, this approach enables late fusion models to achieve performance close to that of early fusion models. These findings underscore the importance of modality-specific annotations in developing robust, self-explainable video models and contribute to advancing interpretable multimodal learning in complex video analysis.

Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset

TL;DR

This work investigates how modality-specific explanatory concepts improve multimodal video interpretation. By introducing Concept Modality Specific Datasets (CMSDs) derived from MOByGaze, the authors show that CMSD-based supervision boosts both early and late fusion models, with late fusion notably approaching early fusion performance when trained with CMSD. The study demonstrates reduced modality-attribution errors and enhanced interpretability, underscoring the value of modality-detailed annotations for robust, self-explainable video analysis. The results advocate for integrating human-provided explanations into multimodal training to advance interpretable learning in complex video tasks.

Abstract

We examine the impact of concept-informed supervision on multimodal video interpretation models using MOByGaze, a dataset containing human-annotated explanatory concepts. We introduce Concept Modality Specific Datasets (CMSDs), which consist of data subsets categorized by the modality (visual, textual, or audio) of annotated concepts. Models trained on CMSDs outperform those using traditional legacy training in both early and late fusion approaches. Notably, this approach enables late fusion models to achieve performance close to that of early fusion models. These findings underscore the importance of modality-specific annotations in developing robust, self-explainable video models and contribute to advancing interpretable multimodal learning in complex video analysis.

Paper Structure

This paper contains 5 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Two positive samples for character objectification from the MObyGaze dataset. The concepts annotated indicate that objectification is due to the visual modality only in (a), and to the textual modality only in (b). For sample (a) (resp. (b)), the model fed with text only (resp. vision only) and trained with the Modality Agnostic Dataset wrongly detects objectification, while it correctly classifies the sample as negative when trained on the text-specific dataset T-CMSD (resp. vision-specific dataset V-CMSD).
  • Figure 2: Distribution of objectifying labels by frequency and duration in the MObyGaze dataset.
  • Figure 3: Normalized distribution of visual, text, and audio concepts across movies.
  • Figure 4: The Concept Modality Agnostic Dataset (CMAD) contains all positive samples (Hard Negative and Sure) across modalities, while the Concept Modality Specific Dataset (CMSD) filters these samples to include only those with modality-specific concepts.
  • Figure 5: Comparison of Early and Late Fusion Models considering all three modalities. Tokens are: Audio (orange), Video (blue), and Text (green).
  • ...and 3 more figures