Table of Contents
Fetching ...

Coarse-to-fine Alignment Makes Better Speech-image Retrieval

Lifeng Zhou, Yuke Li

TL;DR

Problem: improving speech-image retrieval when labeled data is scarce. Approach: jointly train speech-image contrastive learning (SIC) for coarse alignment and speech-image matching (SIM) for fine alignment, aided by an embedding queue and momentum distillation. Key contributions: an end-to-end framework that uses HuBERT for speech, BLIP-2 for images, a multimodal transformer, dual objectives with hard-negative mining, and momentum-based self-distillation; achieves over 4% relative improvement in R@1 on Flickr8k and SpokenCOCO and strong zero-shot generalization. Significance: demonstrates robust cross-modal alignment with noisy supervision and large negative pools, enabling more scalable multimodal speech understanding.

Abstract

In this paper, we propose a novel framework for speech-image retrieval. We utilize speech-image contrastive (SIC) learning tasks to align speech and image representations at a coarse level and speech-image matching (SIM) learning tasks to further refine the fine-grained cross-modal alignment. SIC and SIM learning tasks are jointly trained in a unified manner. To optimize the learning process, we utilize an embedding queue that facilitates efficient sampling of high-quality and diverse negative representations during SIC learning. Additionally, it enhances the learning of SIM tasks by effectively mining hard negatives based on contrastive similarities calculated in SIC tasks. To further optimize learning under noisy supervision, we incorporate momentum distillation into the training process. Experimental results show that our framework outperforms the state-of-the-art method by more than 4% in R@1 on two benchmark datasets for the speech-image retrieval tasks. Moreover, as observed in zero-shot experiments, our framework demonstrates excellent generalization capabilities.

Coarse-to-fine Alignment Makes Better Speech-image Retrieval

TL;DR

Problem: improving speech-image retrieval when labeled data is scarce. Approach: jointly train speech-image contrastive learning (SIC) for coarse alignment and speech-image matching (SIM) for fine alignment, aided by an embedding queue and momentum distillation. Key contributions: an end-to-end framework that uses HuBERT for speech, BLIP-2 for images, a multimodal transformer, dual objectives with hard-negative mining, and momentum-based self-distillation; achieves over 4% relative improvement in R@1 on Flickr8k and SpokenCOCO and strong zero-shot generalization. Significance: demonstrates robust cross-modal alignment with noisy supervision and large negative pools, enabling more scalable multimodal speech understanding.

Abstract

In this paper, we propose a novel framework for speech-image retrieval. We utilize speech-image contrastive (SIC) learning tasks to align speech and image representations at a coarse level and speech-image matching (SIM) learning tasks to further refine the fine-grained cross-modal alignment. SIC and SIM learning tasks are jointly trained in a unified manner. To optimize the learning process, we utilize an embedding queue that facilitates efficient sampling of high-quality and diverse negative representations during SIC learning. Additionally, it enhances the learning of SIM tasks by effectively mining hard negatives based on contrastive similarities calculated in SIC tasks. To further optimize learning under noisy supervision, we incorporate momentum distillation into the training process. Experimental results show that our framework outperforms the state-of-the-art method by more than 4% in R@1 on two benchmark datasets for the speech-image retrieval tasks. Moreover, as observed in zero-shot experiments, our framework demonstrates excellent generalization capabilities.
Paper Structure (11 sections, 6 equations, 1 figure, 3 tables)

This paper contains 11 sections, 6 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: The HuBERT and speech encoder are utilized to extract speech embeddings. The BLIP-2 image encoder is responsible for extracting image embeddings. The speech and image embeddings are fed into the multimodal encoder for interaction. We propose SIC and SIM tasks to jointly align speech and image embeddings. We employ a queue that allows for the sampling of diverse negative representations for the SIC tasks and hard negative examples for the SIM tasks. In order to improve learning with noisy data, we generate pseudo-targets using the momentum model as additional supervision during training.