Dual-branch Prompting for Multimodal Machine Translation
Jie Wang, Zhendong Yang, Liansong Zong, Xiaobo Zhang, Dexian Wang, Ji Zhang
TL;DR
This work tackles robustness in multimodal machine translation by eliminating the need for authentic images at inference and mitigating visual noise. It introduces D2P-MMT, a diffusion-based dual-branch prompting framework that reconstructs images from the source text and jointly learns from authentic and reconstructed visuals through a cross-modal coupling and a KL-consistency loss. The approach yields superior BLEU scores on Multi30K for En-De and En-Fr, outperforming image-free and many image-dependent baselines, and demonstrates strong generalization and ablation-supported effectiveness of the prompting strategy. Overall, the method enhances cross-modal interaction and robustness, offering a practical MMT solution when reliable visual inputs are unavailable during deployment.
Abstract
Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.
