Table of Contents
Fetching ...

Dual-branch Prompting for Multimodal Machine Translation

Jie Wang, Zhendong Yang, Liansong Zong, Xiaobo Zhang, Dexian Wang, Ji Zhang

TL;DR

This work tackles robustness in multimodal machine translation by eliminating the need for authentic images at inference and mitigating visual noise. It introduces D2P-MMT, a diffusion-based dual-branch prompting framework that reconstructs images from the source text and jointly learns from authentic and reconstructed visuals through a cross-modal coupling and a KL-consistency loss. The approach yields superior BLEU scores on Multi30K for En-De and En-Fr, outperforming image-free and many image-dependent baselines, and demonstrates strong generalization and ablation-supported effectiveness of the prompting strategy. Overall, the method enhances cross-modal interaction and robustness, offering a practical MMT solution when reliable visual inputs are unavailable during deployment.

Abstract

Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.

Dual-branch Prompting for Multimodal Machine Translation

TL;DR

This work tackles robustness in multimodal machine translation by eliminating the need for authentic images at inference and mitigating visual noise. It introduces D2P-MMT, a diffusion-based dual-branch prompting framework that reconstructs images from the source text and jointly learns from authentic and reconstructed visuals through a cross-modal coupling and a KL-consistency loss. The approach yields superior BLEU scores on Multi30K for En-De and En-Fr, outperforming image-free and many image-dependent baselines, and demonstrates strong generalization and ablation-supported effectiveness of the prompting strategy. Overall, the method enhances cross-modal interaction and robustness, offering a practical MMT solution when reliable visual inputs are unavailable during deployment.

Abstract

Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.

Paper Structure

This paper contains 36 sections, 25 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Illustrations of classic MMT model and our proposed D$^2$P-MMT framework. In the authentic image, the red bounding box highlights the main content of the sentence, while the yellow bounding box indicates redundant information. In our method, irrelevant visual information is filtered out by reconstructing the image.
  • Figure 2: The overall framework of the proposed D$^{2}$P-MMT model. It consists of four stages: image feature reconstruction, visual prompt generation, dual-branch prompting, and language translation. Images are reconstructed using pretrained diffusion models, and text prompts are adjusted based on visual prompts via a coupling function ${\mathcal{F}(\cdot)}$ to facilitate cross-modal interaction. The final translation output is derived from two input streams: the reconstructed fused representation $Z^{d}$ and the authentic fused representation $Z^{a}$.
  • Figure 3: The forward diffusion and reverse diffusion process of the image.
  • Figure 4: Implementation of visual multi-level prompt enhancement module.
  • Figure 5: Our inference process uses only the source sentence and the reconstructed image as input.
  • ...and 1 more figures