Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages
Baban Gain, Dibyanayan Bandyopadhyay, Samrat Mukherjee, Chandranath Adak, Asif Ekbal
TL;DR
The study questions the added value of visual context in high-resource English-to-Indic MT by benchmarking strong unimodal baselines against two multimodal architectures under non-noisy and noisy text. It finds that images are largely redundant with large-scale unimodal pre-training, but provide modest benefits in noisy conditions—especially when using cropped features in low-noise scenarios and full-image features in high-noise contexts. Through probing with random images, CLIP/ViT feature comparisons, and gate analyses, the paper argues that visual context mainly acts as a regularizer rather than supplying meaningful disambiguation. These findings highlight the need for datasets where the image is essential for correct translation and for mechanisms that truly leverage visual semantics in multimodal MT.
Abstract
Neural Machine Translation (NMT) has made remarkable progress using large-scale textual data, but the potential of incorporating multimodal inputs, especially visual information, remains underexplored in high-resource settings. While prior research has focused on using multimodal data in low-resource scenarios, this study examines how image features impact translation when added to a large-scale, pre-trained unimodal NMT system. Surprisingly, the study finds that images might be redundant in this context. Additionally, the research introduces synthetic noise to assess whether images help the model handle textual noise. Multimodal models slightly outperform text-only models in noisy settings, even when random images are used. The study's experiments translate from English to Hindi, Bengali, and Malayalam, significantly outperforming state-of-the-art benchmarks. Interestingly, the effect of visual context varies with the level of source text noise: no visual context works best for non-noisy translations, cropped image features are optimal for low noise, and full image features perform better in high-noise scenarios. This sheds light on the role of visual context, especially in noisy settings, and opens up a new research direction for Noisy Neural Machine Translation in multimodal setups. The research emphasizes the importance of combining visual and textual information to improve translation across various environments. Our code is publicly available at https://github.com/babangain/indicMMT.
