Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation
Andong Chen, Yuchen Song, Kehai Chen, Muyun Yang, Tiejun Zhao, Min Zhang
TL;DR
This paper addresses the bottleneck of relying on image annotations in multimodal machine translation by introducing IMAGE, a framework that uses a stable diffusion–based imagination module to generate sentence-specific images conditioned on source text. A reinforcement learning signal based on alignment between linguistic scene graphs (LSG) and visual scene graphs (VSG) enforces consistency without image-text annotations, enabling gains in both multimodal and text-only MT. Empirically, IMAGE substantially improves over text-only LLM MT and traditional MMT on Multi30K and WMT24 benchmarks, with notable gains in low-resource settings and a strong correlation between alignment rewards and translation quality. The approach demonstrates the potential of integrating imaginative visuals with multimodal LLMs to enhance translation accuracy and suggests scalable pathways for image-free supervision in MT.
Abstract
Visual information has been introduced for enhancing machine translation (MT), and its effectiveness heavily relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations. In this paper, we introduce a stable diffusion-based imagination network into a multimodal large language model (MLLM) to explicitly generate an image for each source sentence, thereby advancing the multimodel MT. Particularly, we build heuristic human feedback with reinforcement learning to ensure the consistency of the generated image with the source sentence without the supervision of image annotation, which breaks the bottleneck of using visual information in MT. Furthermore, the proposed method enables imaginative visual information to be integrated into large-scale text-only MT in addition to multimodal MT. Experimental results show that our model significantly outperforms existing multimodal MT and text-only MT, especially achieving an average improvement of more than 14 BLEU points on Multi30K multimodal MT benchmarks.
