Table of Contents
Fetching ...

Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models

Khazar Khorrami, Okko Räsänen

TL;DR

Addresses unsupervised audiovisual alignment between speech and images in visually grounded speech models. Proposes Alignment Score ($AS$) and Glancing Score ($GS$) to formalize and evaluate temporospatial alignments, and introduces a cross-modal-attention VGS variant that improves both localization and retrieval on SPEECH-COCO/MSCOCO. Evaluations on Word2vec-ground-truth word-object pairs reveal that attention-enhanced models outperform prior DAVEnet-like baselines in both alignment metrics and semantic retrieval. The work provides two robust, complementary metrics for cross-modal alignment and demonstrates the practical viability of attention-guided VGS for unsupervised object-word localization, with potential applicability to broader cross-modal alignment problems.

Abstract

Systems that can find correspondences between multiple modalities, such as between speech and images, have great potential to solve different recognition and data analysis tasks in an unsupervised manner. This work studies multimodal learning in the context of visually grounded speech (VGS) models, and focuses on their recently demonstrated capability to extract spatiotemporal alignments between spoken words and the corresponding visual objects without ever been explicitly trained for object localization or word recognition. As the main contributions, we formalize the alignment problem in terms of an audiovisual alignment tensor that is based on earlier VGS work, introduce systematic metrics for evaluating model performance in aligning visual objects and spoken words, and propose a new VGS model variant for the alignment task utilizing cross-modal attention layer. We test our model and a previously proposed model in the alignment task using SPEECH-COCO captions coupled with MSCOCO images. We compare the alignment performance using our proposed evaluation metrics to the semantic retrieval task commonly used to evaluate VGS models. We show that cross-modal attention layer not only helps the model to achieve higher semantic cross-modal retrieval performance, but also leads to substantial improvements in the alignment performance between image object and spoken words.

Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models

TL;DR

Addresses unsupervised audiovisual alignment between speech and images in visually grounded speech models. Proposes Alignment Score () and Glancing Score () to formalize and evaluate temporospatial alignments, and introduces a cross-modal-attention VGS variant that improves both localization and retrieval on SPEECH-COCO/MSCOCO. Evaluations on Word2vec-ground-truth word-object pairs reveal that attention-enhanced models outperform prior DAVEnet-like baselines in both alignment metrics and semantic retrieval. The work provides two robust, complementary metrics for cross-modal alignment and demonstrates the practical viability of attention-guided VGS for unsupervised object-word localization, with potential applicability to broader cross-modal alignment problems.

Abstract

Systems that can find correspondences between multiple modalities, such as between speech and images, have great potential to solve different recognition and data analysis tasks in an unsupervised manner. This work studies multimodal learning in the context of visually grounded speech (VGS) models, and focuses on their recently demonstrated capability to extract spatiotemporal alignments between spoken words and the corresponding visual objects without ever been explicitly trained for object localization or word recognition. As the main contributions, we formalize the alignment problem in terms of an audiovisual alignment tensor that is based on earlier VGS work, introduce systematic metrics for evaluating model performance in aligning visual objects and spoken words, and propose a new VGS model variant for the alignment task utilizing cross-modal attention layer. We test our model and a previously proposed model in the alignment task using SPEECH-COCO captions coupled with MSCOCO images. We compare the alignment performance using our proposed evaluation metrics to the semantic retrieval task commonly used to evaluate VGS models. We show that cross-modal attention layer not only helps the model to achieve higher semantic cross-modal retrieval performance, but also leads to substantial improvements in the alignment performance between image object and spoken words.

Paper Structure

This paper contains 9 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: An illustration of a temporospatial alignment tensor T[x ,y, t] between an image and an utterance related to the image. Red "clouds" illustrate manifolds where T has positive values. Ground-truth alignment region for [dog] is visualized with a violet box. Conceptually adapted from harwath2018jointly.
  • Figure 2: CNN${_\textup{ATT}}\textup{v0}$ architecture using a cross-modal attention block on top of DAVEnet model. 64 = number of time frames, 196 = flattened spatial coordinates (x,y), $\curvearrowright$ = transpose. The layers used in alignment evaluation are highlighted with green.
  • Figure 3: Relationship of alignment score (top) and glancing score (down) with average object size and average word duration (for CNN${_\textup{ATT}}$v1 softmax). Solid lines = linear fits to the data. Shaded lines = corresponding fits to random baselines.