Adding Multimodal Capabilities to a Text-only Translation Model

Vipin Vijayan; Braeden Bowen; Scott Grigsby; Timothy Anderson; Jeremy Gwinnup

Adding Multimodal Capabilities to a Text-only Translation Model

Vipin Vijayan, Braeden Bowen, Scott Grigsby, Timothy Anderson, Jeremy Gwinnup

TL;DR

This work incrementally transforms the MT model into an MMT model by pre-training using vision-based masking of the source text and fine-tuning on Multi30k, and achieves a state-of-the-art performance on the Multi30k 2016 en-de test set.

Abstract

While most current work in multimodal machine translation (MMT) uses the Multi30k dataset for training and evaluation, we find that the resulting models overfit to the Multi30k dataset to an extreme degree. Consequently, these models perform very badly when evaluated against typical text-only testing sets such as the WMT newstest datasets. In order to perform well on both Multi30k and typical text-only datasets, we use a performant text-only machine translation (MT) model as the starting point of our MMT model. We add vision-text adapter layers connected via gating mechanisms to the MT model, and incrementally transform the MT model into an MMT model by 1) pre-training using vision-based masking of the source text and 2) fine-tuning on Multi30k.

Adding Multimodal Capabilities to a Text-only Translation Model

TL;DR

Abstract

Paper Structure (30 sections, 3 figures, 6 tables)

This paper contains 30 sections, 3 figures, 6 tables.

Introduction
Related Works
Adapting pre-trained models for MMT
Masking for visual grounding
Gating mechanism for MMT
Methods
GRAM model architecture
Vision encoder
Perceiver resampler
Vision-text layer
Model hyper-parameters
Training
Pre-training
Training against Multi30k
Results and Discussion
...and 15 more sections

Figures (3)

Figure 1: Multimodal translation architecture, where multimodal components are incorporated into the Transformer translation model introduced by vaswani_attention_2017. The parameters in the model bordered by red are initialized randomly and updated for training, while the parameters in the pre-trained vision encoder and the pre-trained Transformer translation model bordered by black are frozen. The gating parameters in the vision-text layers are updated using back-propagation, allowing us to smoothly transition from a text-only translation model into a multimodal translation model.
Figure 2: Examples from the CoMMuTE test dataset of our model (the $M_{\text{CR},\text{M30k}}$ model from Table \ref{['tab:main']}) resolving ambiguous input text when given contextual images.
Figure 3: Gating values during a) pre-training over the CR dataset, b) fine-tuning over the Multi30k dataset, and c) directly training on the Multi30k dataset. Layer 1 is the vision-text adapter layer that is closest to the input. Note that some of the gating values overlap in some of the plots.

Adding Multimodal Capabilities to a Text-only Translation Model

TL;DR

Abstract

Adding Multimodal Capabilities to a Text-only Translation Model

Authors

TL;DR

Abstract

Table of Contents

Figures (3)