Table of Contents
Fetching ...

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath

TL;DR

This work addresses how to further improve self-supervised speech representations through visual grounding by introducing CIF-based dynamic keyword segmentation and a hybrid multi-task framework that merges cascaded and parallel SpeechCLIP architectures. CIF enables monotonic, flexible subword segmentation and a trainable $L_{QUA}$ objective, while the Cascaded+ and Hybrid SpeechCLIP+ designs merge subword-level and utterance-level cues for improved speech-to-image alignment. Empirical results on Flickr8k and SpokenCOCO show that CIF-based cascaded models reduce keyword duplicates and improve word/BPE extraction, and that joint multi-task training boosts image-speech retrieval performance in certain settings, with some dataset-dependent variations. Overall, the paper demonstrates that combining monotonic segmentation with multi-task learning can enhance subword-level representations and cross-modal alignment in visually grounded speech systems.

Abstract

The recently proposed visually grounded speech model SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription. On this basis, this paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture. Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework. Our experimental evaluation is performed on the Flickr8k and SpokenCOCO datasets. The results show that in the speech keyword extraction task, the CIF-based cascaded SpeechCLIP model outperforms the previous cascaded SpeechCLIP model using a fixed number of CLS tokens. Furthermore, through our hybrid architecture, cascaded task learning boosts the performance of the parallel branch in image-speech retrieval tasks.

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

TL;DR

This work addresses how to further improve self-supervised speech representations through visual grounding by introducing CIF-based dynamic keyword segmentation and a hybrid multi-task framework that merges cascaded and parallel SpeechCLIP architectures. CIF enables monotonic, flexible subword segmentation and a trainable objective, while the Cascaded+ and Hybrid SpeechCLIP+ designs merge subword-level and utterance-level cues for improved speech-to-image alignment. Empirical results on Flickr8k and SpokenCOCO show that CIF-based cascaded models reduce keyword duplicates and improve word/BPE extraction, and that joint multi-task training boosts image-speech retrieval performance in certain settings, with some dataset-dependent variations. Overall, the paper demonstrates that combining monotonic segmentation with multi-task learning can enhance subword-level representations and cross-modal alignment in visually grounded speech systems.

Abstract

The recently proposed visually grounded speech model SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription. On this basis, this paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture. Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework. Our experimental evaluation is performed on the Flickr8k and SpokenCOCO datasets. The results show that in the speech keyword extraction task, the CIF-based cascaded SpeechCLIP model outperforms the previous cascaded SpeechCLIP model using a fixed number of CLS tokens. Furthermore, through our hybrid architecture, cascaded task learning boosts the performance of the parallel branch in image-speech retrieval tasks.
Paper Structure (11 sections, 5 equations, 2 figures, 3 tables)

This paper contains 11 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of the proposed models. BN and VQ denote batch normalization and vector quantization processes, respectively. (a) In hybrid SpeechCLIP, the training loss combines the contrastive loss between the leftmost CLS token and the output image representation of the CLIP image encoder (the parallel branch same as parallel SpeechCLIP shih2022speechclip) and the contrastive loss between the output speech representation of the CLIP text encoder for the remaining $K$ CLS tokens and the output image representation of the CLIP image encoder (the cascaded branch same as cascaded SpeechCLIP shih2022speechclip). (b) In cascaded SpeechCLIP+, instead of extracting keyword information through a fixed number of learnable CLS tokens, CIF is used to segment frame-level features into subword-level keyword sequences. In hybrid SpeechCLIP+, the parallel branch is based on parallel SpeechCLIP, and the cascaded branch is based on cascaded SpeechCLIP+.
  • Figure 2: An example of keywords extracted by Cascaded SpeechCLIP+ from the Flickr8k test set, showing the image, spoken caption, and extracted keywords with corresponding segments.