Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset
Elisa Ancarani, Julie Tores, Lucile Sassatelli, Rémy Sun, Hui-Yin Wu, Frédéric Precioso
TL;DR
This work investigates how modality-specific explanatory concepts improve multimodal video interpretation. By introducing Concept Modality Specific Datasets (CMSDs) derived from MOByGaze, the authors show that CMSD-based supervision boosts both early and late fusion models, with late fusion notably approaching early fusion performance when trained with CMSD. The study demonstrates reduced modality-attribution errors and enhanced interpretability, underscoring the value of modality-detailed annotations for robust, self-explainable video analysis. The results advocate for integrating human-provided explanations into multimodal training to advance interpretable learning in complex video tasks.
Abstract
We examine the impact of concept-informed supervision on multimodal video interpretation models using MOByGaze, a dataset containing human-annotated explanatory concepts. We introduce Concept Modality Specific Datasets (CMSDs), which consist of data subsets categorized by the modality (visual, textual, or audio) of annotated concepts. Models trained on CMSDs outperform those using traditional legacy training in both early and late fusion approaches. Notably, this approach enables late fusion models to achieve performance close to that of early fusion models. These findings underscore the importance of modality-specific annotations in developing robust, self-explainable video models and contribute to advancing interpretable multimodal learning in complex video analysis.
