Table of Contents
Fetching ...

Translation-Enhanced Multilingual Text-to-Image Generation

Yaoyiran Li, Ching-Yun Chang, Stephen Rawls, Ivan Vulić, Anna Korhonen

TL;DR

This work tackles multilingual text-to-image generation (mTTI) by evaluating translation-based cross-lingual transfer and introducing a parameter-efficient Ensemble Adapter (EnsAd) to fuse MT outputs. It systematically compares Translate Train, Translate Test, and Zero-Shot Transfer, showing that zero-shot transfer often outperforms translation-based inference, and that translation quality influences performance. The core contribution, EnsAd, aggregates multiple English translations via attention to bridge language gaps with only about 0.1% extra parameters, delivering consistent gains across COCO-CN, Multi30K Task2, LAION-5B, and IGLUE datasets. The findings demonstrate translation-enhanced mTTI potential and provide practical guidance for building multilingual TTI systems with minimal additional capacity, paving the way for broader language coverage in vision-language models.

Abstract

Research on text-to-image generation (TTI) still predominantly focuses on the English language due to the lack of annotated image-caption data in other languages; in the long run, this might widen inequitable access to TTI technology. In this work, we thus investigate multilingual TTI (termed mTTI) and the current potential of neural machine translation (NMT) to bootstrap mTTI systems. We provide two key contributions. 1) Relying on a multilingual multi-modal encoder, we provide a systematic empirical study of standard methods used in cross-lingual NLP when applied to mTTI: Translate Train, Translate Test, and Zero-Shot Transfer. 2) We propose Ensemble Adapter (EnsAd), a novel parameter-efficient approach that learns to weigh and consolidate the multilingual text knowledge within the mTTI framework, mitigating the language gap and thus improving mTTI performance. Our evaluations on standard mTTI datasets COCO-CN, Multi30K Task2, and LAION-5B demonstrate the potential of translation-enhanced mTTI systems and also validate the benefits of the proposed EnsAd which derives consistent gains across all datasets. Further investigations on model variants, ablation studies, and qualitative analyses provide additional insights on the inner workings of the proposed mTTI approaches.

Translation-Enhanced Multilingual Text-to-Image Generation

TL;DR

This work tackles multilingual text-to-image generation (mTTI) by evaluating translation-based cross-lingual transfer and introducing a parameter-efficient Ensemble Adapter (EnsAd) to fuse MT outputs. It systematically compares Translate Train, Translate Test, and Zero-Shot Transfer, showing that zero-shot transfer often outperforms translation-based inference, and that translation quality influences performance. The core contribution, EnsAd, aggregates multiple English translations via attention to bridge language gaps with only about 0.1% extra parameters, delivering consistent gains across COCO-CN, Multi30K Task2, LAION-5B, and IGLUE datasets. The findings demonstrate translation-enhanced mTTI potential and provide practical guidance for building multilingual TTI systems with minimal additional capacity, paving the way for broader language coverage in vision-language models.

Abstract

Research on text-to-image generation (TTI) still predominantly focuses on the English language due to the lack of annotated image-caption data in other languages; in the long run, this might widen inequitable access to TTI technology. In this work, we thus investigate multilingual TTI (termed mTTI) and the current potential of neural machine translation (NMT) to bootstrap mTTI systems. We provide two key contributions. 1) Relying on a multilingual multi-modal encoder, we provide a systematic empirical study of standard methods used in cross-lingual NLP when applied to mTTI: Translate Train, Translate Test, and Zero-Shot Transfer. 2) We propose Ensemble Adapter (EnsAd), a novel parameter-efficient approach that learns to weigh and consolidate the multilingual text knowledge within the mTTI framework, mitigating the language gap and thus improving mTTI performance. Our evaluations on standard mTTI datasets COCO-CN, Multi30K Task2, and LAION-5B demonstrate the potential of translation-enhanced mTTI systems and also validate the benefits of the proposed EnsAd which derives consistent gains across all datasets. Further investigations on model variants, ablation studies, and qualitative analyses provide additional insights on the inner workings of the proposed mTTI approaches.
Paper Structure (23 sections, 6 equations, 3 figures, 9 tables, 1 algorithm)

This paper contains 23 sections, 6 equations, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: An overview of the full proposed mTTI framework with the Ensemble Adapter module. The black blocks are networks and contrastive learning (CL) losses already in the original LAFITE model (also in our pretrained mLAFITE). Our proposed, newly added modules, and a CL loss are provided in red, gridded blocks.
  • Figure 2: TTI Examples generated with Translate Test, Zero-Shot Transfer, and our best model. COCO-CN (zh) Test Set: row $1$$-$$2$; Multi30K Task2 Test Set (de): row $3$$-$$4$; LAION-5B (fi) Test Set: row $5$$-$$6$. The resolution of the generated images is $256\times256$ pixels; ground-truth images are shown in their original sizes respectively.
  • Figure 3: Images generated with and without manually added information (COCO-CN Test set). The resolution of the generated images is $256\times256$ pixels; ground-truth images are shown in their original sizes respectively.