Table of Contents
Fetching ...

Visually grounded few-shot word learning in low-resource settings

Leanne Nortje, Dan Oneata, Herman Kamper

TL;DR

The paper tackles visually grounded few-shot word learning in low-resource settings by combining a novel mining-based data augmentation pipeline with a word-to-image attention model (MattNet). It demonstrates superior few-shot retrieval and classification performance on natural English images and extends the approach to Yorùbá, a real low-resource language, showing cross-language transfer from English multimodal data improves results. Key contributions include the QbERT-based cross-modal pair mining, the MattNet architecture with a dedicated word-to-image attention mechanism, and thorough analyses of mistakes, contextual bias, and scalability to more keywords. The work advances practical multimodal word acquisition for under-resourced languages and provides a foundation for expanding visually grounded speech systems beyond English to real-world linguistic diversity.

Abstract

We propose a visually grounded speech model that learns new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this few-shot learning problem by either using an artificial setting with digit word-image pairs or by using a large number of examples per class. Moreover, all previous studies were performed using English speech-image data. We propose an approach that can work on natural word-image pairs but with less examples, i.e. fewer shots, and then illustrate how this approach can be applied for multimodal few-shot learning in a real low-resource language, Yorùbá. Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images. Additionally, we use a word-to-image attention mechanism to determine word-image similarity. With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark. Many of the model's mistakes are due to confusion between visual concepts co-occurring in similar contexts. The experiments on Yorùbá show the benefit of transferring knowledge from a multimodal model trained on a larger set of English speech-image data.

Visually grounded few-shot word learning in low-resource settings

TL;DR

The paper tackles visually grounded few-shot word learning in low-resource settings by combining a novel mining-based data augmentation pipeline with a word-to-image attention model (MattNet). It demonstrates superior few-shot retrieval and classification performance on natural English images and extends the approach to Yorùbá, a real low-resource language, showing cross-language transfer from English multimodal data improves results. Key contributions include the QbERT-based cross-modal pair mining, the MattNet architecture with a dedicated word-to-image attention mechanism, and thorough analyses of mistakes, contextual bias, and scalability to more keywords. The work advances practical multimodal word acquisition for under-resourced languages and provides a foundation for expanding visually grounded speech systems beyond English to real-world linguistic diversity.

Abstract

We propose a visually grounded speech model that learns new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this few-shot learning problem by either using an artificial setting with digit word-image pairs or by using a large number of examples per class. Moreover, all previous studies were performed using English speech-image data. We propose an approach that can work on natural word-image pairs but with less examples, i.e. fewer shots, and then illustrate how this approach can be applied for multimodal few-shot learning in a real low-resource language, Yorùbá. Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images. Additionally, we use a word-to-image attention mechanism to determine word-image similarity. With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark. Many of the model's mistakes are due to confusion between visual concepts co-occurring in similar contexts. The experiments on Yorùbá show the benefit of transferring knowledge from a multimodal model trained on a larger set of English speech-image data.
Paper Structure (28 sections, 1 equation, 7 figures, 10 tables)

This paper contains 28 sections, 1 equation, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Given the few examples in the support set $\mathcal{S}$, the multimodal few-shot classification task is to e.g. identify the image depicting the word "zebra" from a set of unseen images.
  • Figure 2: MattNet consists of (c) a vision and an audio network. The audio network consists of (a + b) an acoustic context network and a BiLSTM network. The audio and vision networks are connected with a word-to-image attention mechanism.
  • Figure 3: The SpokenCOCO data splits used to train and evaluate the MattNet model. The background data (blue background) consists of spoken audio utterances and images belonging to concepts not present in the support set. The mining splits consist of single-modality audio and image samples from which the training data is artificially extended, and include both background samples and samples belonging to the few-shot classes (green).
  • Figure 4: Examples of retrieval and few-shot classification for two queries using the $K = 100$ MattNet model. Concepts that associate strongly with context, such as fire hydrant which often appears in urban environments, are more challenging to retrieve than to classify.
  • Figure 5: The top five ranked samples for audio query corresponding to each of the five concepts using the $K=100$ MattNet model. For each image, we show whether it is correct (if it contains the query concept) and the attention explanation (red indicating the input regions relevant for the given audio query). The IOU values give the quantitative localisation performance of the attention explanations: the intersection over union of the binarised attentions with the ground truth annotations averaged over all images that contain a given concept.
  • ...and 2 more figures