Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling

Leanne Nortje

Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling

Leanne Nortje

TL;DR

This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images and introduces a task called visually prompted keyword localisation to detect and localise keywords in speech using images.

Abstract

This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images. It focuses on applications for low-resource languages and understanding human language acquisition. We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images. We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba. Additionally, we examine the mutual exclusivity bias in VGS models. Our monolingual VGS model exhibits this bias, but we found that multilingualism does not affect the bias in this VGS model similarly to what is observed in children.

Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling

TL;DR

Abstract

Paper Structure (74 sections, 1 equation, 13 figures, 16 tables)

This paper contains 74 sections, 1 equation, 13 figures, 16 tables.

Introduction
Motivation
Low-resource Technology Low-resource Technology
Computational Cognitive Modelling Computational Cognitive Modelling
Visually Grounded Speech Modelling Overview
The General VGS Model Architecture The General VGS Model Architecture
Taking Advantage of State-of-the-Art Unimodal Speech or Vision Models Taking Advantage of State-of-the-Art Unimodal Speech or Vision Models
Multilingual VGS Modelling Multilingual VGS Modelling
Using VGS Models For Low-Resource Languages Using VGS Models For Low-Resource Languages
Research Questions
Research Questions on Low-Resource Applications Research Questions on Low-Resource Applications
Research Questions on Computationally Studying the Mutual Exclusivity Bias Research Questions on Computationally Studying the Mutual Exclusivity Bias
Approach
VPKL (VPKL) VPKL (VPKL)
Visually Grounded Few-Shot Spoken Word Acquisition for Low-Resource Languages Visually Grounded Few-Shot Spoken Word Acquisition for Low-Resource Languages
...and 59 more sections

Figures (13)

Figure 1: Most VGS studies use the general model structure consisting of an audio and a vision network connected with a multimodal mechanism.
Figure 2: The VGS models in this dissertation can be organised into two high-level categories: low-resource applications and computational ME studies. As a result, Research Question 1 and Research Question 2 are placed under low-resource applications, and Research Question 3 and Research Question 4 are placed under computational ME studies.
Figure 3: This figure contains a detailed dissertation outline. The dissertation consists of two parts: low-resource applications and computational cognitive ME studies. This figure shows in which categories the research questions are placed. It also shows in which chapters and papers we attempt to answer each research question.
Figure 4: (c) LocalisationAttentionNet presented in Research Paper 1, consists of a vision network and an audio (a+b) network. The two branches are connected through a multimodal attention mechanism consisting of a matchmap harwath_jointly_2018. The model outputs a similarity score $S$ for a speech and an image input based on the context vectors obtained from the matchmap.
Figure 5: (c) Loc-AttNet consists of a vision and an audio (a+b) network. The two branches are connected through a multimodal attention mechanism consisting of a matchmap $\mathcal{M}$harwath_jointly_2018 used to calculate a similarity score $S$ for a speech and an image input.
...and 8 more figures

Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling

TL;DR

Abstract

Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling

Authors

TL;DR

Abstract

Table of Contents

Figures (13)