Detecting Concrete Visual Tokens for Multimodal Machine Translation

Braeden Bowen; Vipin Vijayan; Scott Grigsby; Timothy Anderson; Jeremy Gwinnup

Detecting Concrete Visual Tokens for Multimodal Machine Translation

Braeden Bowen, Vipin Vijayan, Scott Grigsby, Timothy Anderson, Jeremy Gwinnup

TL;DR

This work tackles the problem of visual grounding in multimodal machine translation by proposing three concrete-token detection methods (NLTK-based concreteness, MDETR-based object detection, and a joint verification approach) and four token-selection strategies to mask source tokens. By synthesizing masked sentence-image datasets from Multi30k and training GRAM with these configurations, the study demonstrates improved utilization of visual context, achieving CoMMuTE scores up to $0.67$ and BLEU scores up to $46.2$ in some configurations. A key finding is that the NLTK-based detection often outperforms the more visually driven MDETR and Joint methods, while deterministic token selection does not consistently beat random selection, suggesting that random masking can still be competitive in this setting. The results provide practical guidance for designing token masking and selection strategies in MMT and highlight the complex relationship between detection rate, grounding quality, and translation metrics across datasets like Multi30k and COCO.

Abstract

The challenge of visual grounding and masking in multimodal machine translation (MMT) systems has encouraged varying approaches to the detection and selection of visually-grounded text tokens for masking. We introduce new methods for detection of visually and contextually relevant (concrete) tokens from source sentences, including detection with natural language processing (NLP), detection with object detection, and a joint detection-verification technique. We also introduce new methods for selection of detected tokens, including shortest $n$ tokens, longest $n$ tokens, and all detected concrete tokens. We utilize the GRAM MMT architecture to train models against synthetically collated multimodal datasets of source images with masked sentences, showing performance improvements and improved usage of visual context during translation tasks over the baseline model.

Detecting Concrete Visual Tokens for Multimodal Machine Translation

TL;DR

and BLEU scores up to

in some configurations. A key finding is that the NLTK-based detection often outperforms the more visually driven MDETR and Joint methods, while deterministic token selection does not consistently beat random selection, suggesting that random masking can still be competitive in this setting. The results provide practical guidance for designing token masking and selection strategies in MMT and highlight the complex relationship between detection rate, grounding quality, and translation metrics across datasets like Multi30k and COCO.

Abstract

tokens, longest

tokens, and all detected concrete tokens. We utilize the GRAM MMT architecture to train models against synthetically collated multimodal datasets of source images with masked sentences, showing performance improvements and improved usage of visual context during translation tasks over the baseline model.

Paper Structure (19 sections, 4 figures, 3 tables)

This paper contains 19 sections, 4 figures, 3 tables.

Introduction
Related Works
Masking for Visual Grounding
Token Selection for Visual Grounding
Approach
Detection of Concrete Tokens
Detection with NLTK
Detection with MDETR
Detection with Joint Visual Grounding
Synthetic Dataset Collation
Token Selection Techniques
GRAM Model
Results and Discussion
Experimental Framework
Results
...and 4 more sections

Figures (4)

Figure 1: Multi30k source pairs (image, SRC) with results from each detection technique (DT) and an example masked source text (MSK). DT1 represents the NLTK technique; DT2 represents the MDETR Detection technique; DT3 represents the Joint Detection technique. The masked sentence MSK represents a possible masked sentence based on the bold token in the DT3 detections.
Figure 2: An example hypernym graph. The original token, sedan, its three synset entries (labeled in blue), and its associated concrete hypernyms (labeled in red).
Figure 3: Multi30k source pair (image, SRC) with results from the MDETR (DT2, top image) and Joint (DT3, bottom image) detection techniques. MDETR query strings, bounding boxes, and confidence scores are shown. In this example, supplying the entire source sentence as text input to the MDETR object detection model incorrectly identifies the peppers being cooked, while querying only the word "pepper" more closely identifies the region containing the query.
Figure 4: GRAM model architecture from vijayan-multimodal-2024.

Detecting Concrete Visual Tokens for Multimodal Machine Translation

TL;DR

Abstract

Detecting Concrete Visual Tokens for Multimodal Machine Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)