Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages

Baban Gain; Dibyanayan Bandyopadhyay; Samrat Mukherjee; Chandranath Adak; Asif Ekbal

Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages

Baban Gain, Dibyanayan Bandyopadhyay, Samrat Mukherjee, Chandranath Adak, Asif Ekbal

TL;DR

The study questions the added value of visual context in high-resource English-to-Indic MT by benchmarking strong unimodal baselines against two multimodal architectures under non-noisy and noisy text. It finds that images are largely redundant with large-scale unimodal pre-training, but provide modest benefits in noisy conditions—especially when using cropped features in low-noise scenarios and full-image features in high-noise contexts. Through probing with random images, CLIP/ViT feature comparisons, and gate analyses, the paper argues that visual context mainly acts as a regularizer rather than supplying meaningful disambiguation. These findings highlight the need for datasets where the image is essential for correct translation and for mechanisms that truly leverage visual semantics in multimodal MT.

Abstract

Neural Machine Translation (NMT) has made remarkable progress using large-scale textual data, but the potential of incorporating multimodal inputs, especially visual information, remains underexplored in high-resource settings. While prior research has focused on using multimodal data in low-resource scenarios, this study examines how image features impact translation when added to a large-scale, pre-trained unimodal NMT system. Surprisingly, the study finds that images might be redundant in this context. Additionally, the research introduces synthetic noise to assess whether images help the model handle textual noise. Multimodal models slightly outperform text-only models in noisy settings, even when random images are used. The study's experiments translate from English to Hindi, Bengali, and Malayalam, significantly outperforming state-of-the-art benchmarks. Interestingly, the effect of visual context varies with the level of source text noise: no visual context works best for non-noisy translations, cropped image features are optimal for low noise, and full image features perform better in high-noise scenarios. This sheds light on the role of visual context, especially in noisy settings, and opens up a new research direction for Noisy Neural Machine Translation in multimodal setups. The research emphasizes the importance of combining visual and textual information to improve translation across various environments. Our code is publicly available at https://github.com/babangain/indicMMT.

Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages

TL;DR

Abstract

Paper Structure (32 sections, 4 equations, 4 figures, 13 tables)

This paper contains 32 sections, 4 equations, 4 figures, 13 tables.

Introduction
Related Works
Multimodal Translation
Multimodal Translation on Indian Languages
Context-Aware Translation
Noisy Neural Machine Translation
Dataset Details
Noisy Data Generation
Low Noise
High Noise
Pre-Processing
Methodology
Baseline
Unimodal Fine-tuning
Multimodal Fine-tuning with Selective Attention
...and 17 more sections

Figures (4)

Figure 1: Example of combined Hindi, Bengali, and Malayalam dataset
Figure 2: Selective Attention Architecture for Multimodal MT. (softcopy after zooming-in exhibits better display)
Figure 3: Multimodal Transformer Architecture for Multimodal MT. (softcopy after zooming-in exhibits better display)
Figure 4: Example of the annotation process. This example is obtained from the Challenge Subset of Bengali VG. It is to be noted that the reference is wrong since the image indicates that character refers to words in the banner. However, the reference is referring the word character to protagonist (of movies, stories, etc.)

Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages

TL;DR

Abstract

Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (4)