Table of Contents
Fetching ...

Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages

Baban Gain, Dibyanayan Bandyopadhyay, Samrat Mukherjee, Chandranath Adak, Asif Ekbal

TL;DR

The study questions the added value of visual context in high-resource English-to-Indic MT by benchmarking strong unimodal baselines against two multimodal architectures under non-noisy and noisy text. It finds that images are largely redundant with large-scale unimodal pre-training, but provide modest benefits in noisy conditions—especially when using cropped features in low-noise scenarios and full-image features in high-noise contexts. Through probing with random images, CLIP/ViT feature comparisons, and gate analyses, the paper argues that visual context mainly acts as a regularizer rather than supplying meaningful disambiguation. These findings highlight the need for datasets where the image is essential for correct translation and for mechanisms that truly leverage visual semantics in multimodal MT.

Abstract

Neural Machine Translation (NMT) has made remarkable progress using large-scale textual data, but the potential of incorporating multimodal inputs, especially visual information, remains underexplored in high-resource settings. While prior research has focused on using multimodal data in low-resource scenarios, this study examines how image features impact translation when added to a large-scale, pre-trained unimodal NMT system. Surprisingly, the study finds that images might be redundant in this context. Additionally, the research introduces synthetic noise to assess whether images help the model handle textual noise. Multimodal models slightly outperform text-only models in noisy settings, even when random images are used. The study's experiments translate from English to Hindi, Bengali, and Malayalam, significantly outperforming state-of-the-art benchmarks. Interestingly, the effect of visual context varies with the level of source text noise: no visual context works best for non-noisy translations, cropped image features are optimal for low noise, and full image features perform better in high-noise scenarios. This sheds light on the role of visual context, especially in noisy settings, and opens up a new research direction for Noisy Neural Machine Translation in multimodal setups. The research emphasizes the importance of combining visual and textual information to improve translation across various environments. Our code is publicly available at https://github.com/babangain/indicMMT.

Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages

TL;DR

The study questions the added value of visual context in high-resource English-to-Indic MT by benchmarking strong unimodal baselines against two multimodal architectures under non-noisy and noisy text. It finds that images are largely redundant with large-scale unimodal pre-training, but provide modest benefits in noisy conditions—especially when using cropped features in low-noise scenarios and full-image features in high-noise contexts. Through probing with random images, CLIP/ViT feature comparisons, and gate analyses, the paper argues that visual context mainly acts as a regularizer rather than supplying meaningful disambiguation. These findings highlight the need for datasets where the image is essential for correct translation and for mechanisms that truly leverage visual semantics in multimodal MT.

Abstract

Neural Machine Translation (NMT) has made remarkable progress using large-scale textual data, but the potential of incorporating multimodal inputs, especially visual information, remains underexplored in high-resource settings. While prior research has focused on using multimodal data in low-resource scenarios, this study examines how image features impact translation when added to a large-scale, pre-trained unimodal NMT system. Surprisingly, the study finds that images might be redundant in this context. Additionally, the research introduces synthetic noise to assess whether images help the model handle textual noise. Multimodal models slightly outperform text-only models in noisy settings, even when random images are used. The study's experiments translate from English to Hindi, Bengali, and Malayalam, significantly outperforming state-of-the-art benchmarks. Interestingly, the effect of visual context varies with the level of source text noise: no visual context works best for non-noisy translations, cropped image features are optimal for low noise, and full image features perform better in high-noise scenarios. This sheds light on the role of visual context, especially in noisy settings, and opens up a new research direction for Noisy Neural Machine Translation in multimodal setups. The research emphasizes the importance of combining visual and textual information to improve translation across various environments. Our code is publicly available at https://github.com/babangain/indicMMT.
Paper Structure (32 sections, 4 equations, 4 figures, 13 tables)

This paper contains 32 sections, 4 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Example of combined Hindi, Bengali, and Malayalam dataset
  • Figure 2: Selective Attention Architecture for Multimodal MT. (softcopy after zooming-in exhibits better display)
  • Figure 3: Multimodal Transformer Architecture for Multimodal MT. (softcopy after zooming-in exhibits better display)
  • Figure 4: Example of the annotation process. This example is obtained from the Challenge Subset of Bengali VG. It is to be noted that the reference is wrong since the image indicates that character refers to words in the banner. However, the reference is referring the word character to protagonist (of movies, stories, etc.)