Table of Contents
Fetching ...

Translating speech with just images

Dan Oneata, Herman Kamper

TL;DR

This work addresses translating a low-resource language (Yorùbá) into English without parallel speech-translation data by using images as intermediate supervision. It generates English captions for training images with a pretrained captioner and then trains a transformer-based audio-to-text model that maps Yorùbá speech to English text, keeping most parameters fixed and learning only a small cross-attention adapter plus a projection. Evaluations on FACC and YFACC show modest but meaningful BLEU scores, with caption diversity and decoding strategies significantly impacting performance; generated captions can yield higher BLEU than human references in certain settings, suggesting captions are not the bottleneck. The results demonstrate a viable path for visually grounded translation in low-resource scenarios, while highlighting limitations such as shorter, sometimes hallucinated translations and the need for confidence-estimation techniques for reliable deployment.

Abstract

Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yorùbá, and propose a Yorùbá-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.

Translating speech with just images

TL;DR

This work addresses translating a low-resource language (Yorùbá) into English without parallel speech-translation data by using images as intermediate supervision. It generates English captions for training images with a pretrained captioner and then trains a transformer-based audio-to-text model that maps Yorùbá speech to English text, keeping most parameters fixed and learning only a small cross-attention adapter plus a projection. Evaluations on FACC and YFACC show modest but meaningful BLEU scores, with caption diversity and decoding strategies significantly impacting performance; generated captions can yield higher BLEU than human references in certain settings, suggesting captions are not the bottleneck. The results demonstrate a viable path for visually grounded translation in low-resource scenarios, while highlighting limitations such as shorter, sometimes hallucinated translations and the need for confidence-estimation techniques for reliable deployment.

Abstract

Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yorùbá, and propose a Yorùbá-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.
Paper Structure (10 sections, 5 figures, 1 table)

This paper contains 10 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of our speech translation system. Given an audio in a foreign language (e.g., Yorùbá), we generate natural language translations in a high-resource language (e.g., English). We achieve this with only audio--image pairs by generating captions automatically using a pretrained image captioner and then using these as targets for an audio-to-text model.
  • Figure 2: Our audio-to-text model is a transformer that generates text autoregressively conditioned on audio. The network consists of learnable () cross-attention layers interspersed in a frozen () GPT-2 decoder to integrate wav2vec audio features.
  • Figure 3: Sample captions for the image on top using three types of decoding on the GIT image captioning model.
  • Figure 4: Examples of Yorùbá-to-English translations (top) and English-to-English paraphrases (bottom) for the visually grounded speech models trained on captions generated by GIT with diverse beam search.
  • Figure 5: Performance in terms of the BLEU score of the generated captions, speech translation and speech paraphrasing, for all nine combinations of image models and decoding strategies.