Table of Contents
Fetching ...

Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling

Leanne Nortje

TL;DR

This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images and introduces a task called visually prompted keyword localisation to detect and localise keywords in speech using images.

Abstract

This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images. It focuses on applications for low-resource languages and understanding human language acquisition. We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images. We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba. Additionally, we examine the mutual exclusivity bias in VGS models. Our monolingual VGS model exhibits this bias, but we found that multilingualism does not affect the bias in this VGS model similarly to what is observed in children.

Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling

TL;DR

This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images and introduces a task called visually prompted keyword localisation to detect and localise keywords in speech using images.

Abstract

This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images. It focuses on applications for low-resource languages and understanding human language acquisition. We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images. We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba. Additionally, we examine the mutual exclusivity bias in VGS models. Our monolingual VGS model exhibits this bias, but we found that multilingualism does not affect the bias in this VGS model similarly to what is observed in children.
Paper Structure (74 sections, 1 equation, 13 figures, 16 tables)

This paper contains 74 sections, 1 equation, 13 figures, 16 tables.

Figures (13)

  • Figure 1: Most VGS studies use the general model structure consisting of an audio and a vision network connected with a multimodal mechanism.
  • Figure 2: The VGS models in this dissertation can be organised into two high-level categories: low-resource applications and computational ME studies. As a result, Research Question 1 and Research Question 2 are placed under low-resource applications, and Research Question 3 and Research Question 4 are placed under computational ME studies.
  • Figure 3: This figure contains a detailed dissertation outline. The dissertation consists of two parts: low-resource applications and computational cognitive ME studies. This figure shows in which categories the research questions are placed. It also shows in which chapters and papers we attempt to answer each research question.
  • Figure 4: (c) LocalisationAttentionNet presented in Research Paper 1, consists of a vision network and an audio (a+b) network. The two branches are connected through a multimodal attention mechanism consisting of a matchmap harwath_jointly_2018. The model outputs a similarity score $S$ for a speech and an image input based on the context vectors obtained from the matchmap.
  • Figure 5: (c) Loc-AttNet consists of a vision and an audio (a+b) network. The two branches are connected through a multimodal attention mechanism consisting of a matchmap $\mathcal{M}$harwath_jointly_2018 used to calculate a similarity score $S$ for a speech and an image input.
  • ...and 8 more figures