Table of Contents
Fetching ...

GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation

Jiafeng Xiong, Yuting Zhao

TL;DR

This work tackles the limitation of image-dependent inference in multimodal machine translation by introducing GIIFT, a two-stage framework that learns multimodal knowledge in a unified space using Multimodal Scene Graphs (MSG) and Linguistic Scene Graphs (LSG). A lightweight cross-modal Graph Attention Network (GAT) adapter is trained in Stage 1 on MSGs and then generalized to text-only domains via LSGs in Stage 2, enabling inductive image-free translation with a fixed backbone (mBART). GIIFT demonstrates state-of-the-art performance for image-free translation on Multi30K and notable improvements on WMT, while human evaluation confirms gains in completeness and fluency. The approach effectively embraces modality gaps and enables robust cross-domain generalization, offering practical benefits for scalable, image-free MT deployment in real-world settings.

Abstract

Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without images during inference. Results on the WMT benchmark show significant improvements over the image-free translation baselines, demonstrating the strength of GIIFT towards inductive image-free inference.

GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation

TL;DR

This work tackles the limitation of image-dependent inference in multimodal machine translation by introducing GIIFT, a two-stage framework that learns multimodal knowledge in a unified space using Multimodal Scene Graphs (MSG) and Linguistic Scene Graphs (LSG). A lightweight cross-modal Graph Attention Network (GAT) adapter is trained in Stage 1 on MSGs and then generalized to text-only domains via LSGs in Stage 2, enabling inductive image-free translation with a fixed backbone (mBART). GIIFT demonstrates state-of-the-art performance for image-free translation on Multi30K and notable improvements on WMT, while human evaluation confirms gains in completeness and fluency. The approach effectively embraces modality gaps and enables robust cross-domain generalization, offering practical benefits for scalable, image-free MT deployment in real-world settings.

Abstract

Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without images during inference. Results on the WMT benchmark show significant improvements over the image-free translation baselines, demonstrating the strength of GIIFT towards inductive image-free inference.

Paper Structure

This paper contains 20 sections, 9 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The inductive image-free generalization of GIIFT. GIIFT inductively learns from an entire multimodal domain, then enables image-free inference for MMT or text-only NMT via cross-modal generalization. In contrast, previous models can only learn from the limited overlap between image and text in red.
  • Figure 2: Representation of textual and visual information via MSG. Corresponding objects, attributes, and relations across both Textual Scene Graphs and Image Scene Graphs are depicted using identical colors.
  • Figure 3: Left: Overview of the two-stage GIIFT framework. Stage 1: multimodal learning via MSGs. Stage 2: cross-modal generalization via LSGs. Right: Overview of the architecture of the cross-modal GAT adapter, which inductively learns and fuses the multimodal knowledge for the backbone, mBART.
  • Figure 4: Under image-free inference, full GIIFT (image-free) is compared to GIIFT (w/o. Stage 1) and mBART on Multi30K validation set. The italicized bracketed translations of the German caption mark the differences in red.
  • Figure 5: Case study of GIIFT on image-free inference when compared to GIIFT (w/o. Stage 1) and the mBART. Data points are drawn from the Test2016 set of Multi30K. The gold sentence represents the ground truth. The italicised sentence in the bracket presents the English translation of the German text, while red words highlight the crucial translation differences.