Table of Contents
Fetching ...

Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation

Andong Chen, Yuchen Song, Kehai Chen, Muyun Yang, Tiejun Zhao, Min Zhang

TL;DR

This paper addresses the bottleneck of relying on image annotations in multimodal machine translation by introducing IMAGE, a framework that uses a stable diffusion–based imagination module to generate sentence-specific images conditioned on source text. A reinforcement learning signal based on alignment between linguistic scene graphs (LSG) and visual scene graphs (VSG) enforces consistency without image-text annotations, enabling gains in both multimodal and text-only MT. Empirically, IMAGE substantially improves over text-only LLM MT and traditional MMT on Multi30K and WMT24 benchmarks, with notable gains in low-resource settings and a strong correlation between alignment rewards and translation quality. The approach demonstrates the potential of integrating imaginative visuals with multimodal LLMs to enhance translation accuracy and suggests scalable pathways for image-free supervision in MT.

Abstract

Visual information has been introduced for enhancing machine translation (MT), and its effectiveness heavily relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations. In this paper, we introduce a stable diffusion-based imagination network into a multimodal large language model (MLLM) to explicitly generate an image for each source sentence, thereby advancing the multimodel MT. Particularly, we build heuristic human feedback with reinforcement learning to ensure the consistency of the generated image with the source sentence without the supervision of image annotation, which breaks the bottleneck of using visual information in MT. Furthermore, the proposed method enables imaginative visual information to be integrated into large-scale text-only MT in addition to multimodal MT. Experimental results show that our model significantly outperforms existing multimodal MT and text-only MT, especially achieving an average improvement of more than 14 BLEU points on Multi30K multimodal MT benchmarks.

Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation

TL;DR

This paper addresses the bottleneck of relying on image annotations in multimodal machine translation by introducing IMAGE, a framework that uses a stable diffusion–based imagination module to generate sentence-specific images conditioned on source text. A reinforcement learning signal based on alignment between linguistic scene graphs (LSG) and visual scene graphs (VSG) enforces consistency without image-text annotations, enabling gains in both multimodal and text-only MT. Empirically, IMAGE substantially improves over text-only LLM MT and traditional MMT on Multi30K and WMT24 benchmarks, with notable gains in low-resource settings and a strong correlation between alignment rewards and translation quality. The approach demonstrates the potential of integrating imaginative visuals with multimodal LLMs to enhance translation accuracy and suggests scalable pathways for image-free supervision in MT.

Abstract

Visual information has been introduced for enhancing machine translation (MT), and its effectiveness heavily relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations. In this paper, we introduce a stable diffusion-based imagination network into a multimodal large language model (MLLM) to explicitly generate an image for each source sentence, thereby advancing the multimodel MT. Particularly, we build heuristic human feedback with reinforcement learning to ensure the consistency of the generated image with the source sentence without the supervision of image annotation, which breaks the bottleneck of using visual information in MT. Furthermore, the proposed method enables imaginative visual information to be integrated into large-scale text-only MT in addition to multimodal MT. Experimental results show that our model significantly outperforms existing multimodal MT and text-only MT, especially achieving an average improvement of more than 14 BLEU points on Multi30K multimodal MT benchmarks.

Paper Structure

This paper contains 29 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of the LLMs translation paradigm based on visual information. Figure a: The generated image does not include information about “three women,” and Figure b: The generated image lacks “standing” information. These issues led to the translation error.
  • Figure 2: Overview of our IMAGE framework. The process involves first generating visual information of the translation input sentence using a diffusion model. Next, the translation result is obtained via LLM, informed by the generated visual information and translation of the original input sentence.
  • Figure 3: RL Training Detail. The overview of IMAGE, which leverages an alignment feedback learning framework to comprehensively enhance the visual signals performance.
  • Figure 4: Analysis of the experimental setup for assessing the impact of the Iterative Refinement part on translation performance.
  • Figure 5: Some qualitative results on the comparison of IMAGE against related work on the Multi30K En-De test set. IMAGE, in addition to high quality image generation, correctly generates the number of given instances in the image and represents the scene more accurately overall. GPT-4O refers to using DALL-E for image generation, followed by GPT-4O model performing translation based on the source sentence and the generated image. Red words indicate the parts with translation errors.