Detecting Concrete Visual Tokens for Multimodal Machine Translation
Braeden Bowen, Vipin Vijayan, Scott Grigsby, Timothy Anderson, Jeremy Gwinnup
TL;DR
This work tackles the problem of visual grounding in multimodal machine translation by proposing three concrete-token detection methods (NLTK-based concreteness, MDETR-based object detection, and a joint verification approach) and four token-selection strategies to mask source tokens. By synthesizing masked sentence-image datasets from Multi30k and training GRAM with these configurations, the study demonstrates improved utilization of visual context, achieving CoMMuTE scores up to $0.67$ and BLEU scores up to $46.2$ in some configurations. A key finding is that the NLTK-based detection often outperforms the more visually driven MDETR and Joint methods, while deterministic token selection does not consistently beat random selection, suggesting that random masking can still be competitive in this setting. The results provide practical guidance for designing token masking and selection strategies in MMT and highlight the complex relationship between detection rate, grounding quality, and translation metrics across datasets like Multi30k and COCO.
Abstract
The challenge of visual grounding and masking in multimodal machine translation (MMT) systems has encouraged varying approaches to the detection and selection of visually-grounded text tokens for masking. We introduce new methods for detection of visually and contextually relevant (concrete) tokens from source sentences, including detection with natural language processing (NLP), detection with object detection, and a joint detection-verification technique. We also introduce new methods for selection of detected tokens, including shortest $n$ tokens, longest $n$ tokens, and all detected concrete tokens. We utilize the GRAM MMT architecture to train models against synthetically collated multimodal datasets of source images with masked sentences, showing performance improvements and improved usage of visual context during translation tasks over the baseline model.
