Table of Contents
Fetching ...

Probing the Need for Visual Context in Multimodal Machine Translation

Ozan Caglayan, Pranava Madhyastha, Lucia Specia, Loïc Barrault

TL;DR

The paper interrogates the assumed insignificance of visual context in multimodal MT by applying systematic input degradations to Multi30K. It demonstrates that when source textual context is scarce, visual information can meaningfully improve translations, challenging prior conclusions of modest or no visual benefit. Across color deprivation, entity masking, and progressive masking—and even with incongruent decoding—the visual modality offers robust gains, with French outperforming other languages in some setups. The authors argue for adaptive multimodal fusion that leverages visual grounding when needed and emphasize future work on learning when to integrate modalities to enhance translation robustness and performance.

Abstract

Current work on multimodal machine translation (MMT) has suggested that the visual modality is either unnecessary or only marginally beneficial. We posit that this is a consequence of the very simple, short and repetitive sentences used in the only available dataset for the task (Multi30K), rendering the source text sufficient as context. In the general case, however, we believe that it is possible to combine visual and textual information in order to ground translations. In this paper we probe the contribution of the visual modality to state-of-the-art MMT models by conducting a systematic analysis where we partially deprive the models from source-side textual context. Our results show that under limited textual context, models are capable of leveraging the visual input to generate better translations. This contradicts the current belief that MMT models disregard the visual modality because of either the quality of the image features or the way they are integrated into the model.

Probing the Need for Visual Context in Multimodal Machine Translation

TL;DR

The paper interrogates the assumed insignificance of visual context in multimodal MT by applying systematic input degradations to Multi30K. It demonstrates that when source textual context is scarce, visual information can meaningfully improve translations, challenging prior conclusions of modest or no visual benefit. Across color deprivation, entity masking, and progressive masking—and even with incongruent decoding—the visual modality offers robust gains, with French outperforming other languages in some setups. The authors argue for adaptive multimodal fusion that leverages visual grounding when needed and emphasize future work on learning when to integrate modalities to enhance translation robustness and performance.

Abstract

Current work on multimodal machine translation (MMT) has suggested that the visual modality is either unnecessary or only marginally beneficial. We posit that this is a consequence of the very simple, short and repetitive sentences used in the only available dataset for the task (Multi30K), rendering the source text sufficient as context. In the general case, however, we believe that it is possible to combine visual and textual information in order to ground translations. In this paper we probe the contribution of the visual modality to state-of-the-art MMT models by conducting a systematic analysis where we partially deprive the models from source-side textual context. Our results show that under limited textual context, models are capable of leveraging the visual input to generate better translations. This contradicts the current belief that MMT models disregard the visual modality because of either the quality of the image features or the way they are integrated into the model.

Paper Structure

This paper contains 18 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Entity masking: all masked MMT models are significantly better than the masked NMT (dashed). Incongruent decoding severely worsens all systems. The vanilla NMT baseline is 75.9.
  • Figure 2: Baseline MMT (top) translates the misspelled "son" while the masked MMT (bottom) correctly produces "enfant" (child) by focusing on the image.
  • Figure 3: Multimodal gain in absolute METEOR for progressive masking: the dashed gray curve indicates the percentage of non-masked words in the training set.
  • Figure 4: Attention example from entity masking experiments: (a) Baseline MMT translates the misspelled "son" (song $\rightarrow$ chanson) while (b) the masked MMT achieves a correct translation ([v]$\rightarrow$ enfant) by exploiting the visual modality.
  • Figure 5: Attention example from entity masking experiments where terrier, grass and fence are dropped from the source sentence: (a) Baseline MMT is not able to shift attention from the salient dog to the grass and fence, (b) the attention produced by the masked MMT first shifts to the background area while translating "on lush green [v]" then focuses on the fence.