Table of Contents
Fetching ...

Memory Reviving, Continuing Learning and Beyond: Evaluation of Pre-trained Encoders and Decoders for Multimodal Machine Translation

Zhuang Yu, Shiliang Sun, Jing Zhao, Tengfei Song, Hao Yang

TL;DR

This work systematically analyzes how pre-trained encoders and decoders influence Multimodal Machine Translation (MMT) within a unified CLIP-based framework across English–German and English–French tasks on Multi30K and CoMMuTE. It finds a clear asymmetry: pre-trained decoders consistently improve generation quality, while pre-trained encoders are beneficial only when visual-text alignment is strong. The study also highlights memory reviving (fast early gains) and continuing learning (stabilized, robust improvement) as key dynamics in adapting large pre-trained models to multimodal data. The results offer practical guidance for designing MMT architectures and stress the importance of high-quality cross-modal grounding and alignment to fully exploit multimodal signals.

Abstract

Multimodal Machine Translation (MMT) aims to improve translation quality by leveraging auxiliary modalities such as images alongside textual input. While recent advances in large-scale pre-trained language and vision models have significantly benefited unimodal natural language processing tasks, their effectiveness and role in MMT remain underexplored. In this work, we conduct a systematic study on the impact of pre-trained encoders and decoders in multimodal translation models. Specifically, we analyze how different training strategies, from training from scratch to using pre-trained and partially frozen components, affect translation performance under a unified MMT framework. Experiments are carried out on the Multi30K and CoMMuTE dataset across English-German and English-French translation tasks. Our results reveal that pre-training plays a crucial yet asymmetrical role in multimodal settings: pre-trained decoders consistently yield more fluent and accurate outputs, while pre-trained encoders show varied effects depending on the quality of visual-text alignment. Furthermore, we provide insights into the interplay between modality fusion and pre-trained components, offering guidance for future architecture design in multimodal translation systems.

Memory Reviving, Continuing Learning and Beyond: Evaluation of Pre-trained Encoders and Decoders for Multimodal Machine Translation

TL;DR

This work systematically analyzes how pre-trained encoders and decoders influence Multimodal Machine Translation (MMT) within a unified CLIP-based framework across English–German and English–French tasks on Multi30K and CoMMuTE. It finds a clear asymmetry: pre-trained decoders consistently improve generation quality, while pre-trained encoders are beneficial only when visual-text alignment is strong. The study also highlights memory reviving (fast early gains) and continuing learning (stabilized, robust improvement) as key dynamics in adapting large pre-trained models to multimodal data. The results offer practical guidance for designing MMT architectures and stress the importance of high-quality cross-modal grounding and alignment to fully exploit multimodal signals.

Abstract

Multimodal Machine Translation (MMT) aims to improve translation quality by leveraging auxiliary modalities such as images alongside textual input. While recent advances in large-scale pre-trained language and vision models have significantly benefited unimodal natural language processing tasks, their effectiveness and role in MMT remain underexplored. In this work, we conduct a systematic study on the impact of pre-trained encoders and decoders in multimodal translation models. Specifically, we analyze how different training strategies, from training from scratch to using pre-trained and partially frozen components, affect translation performance under a unified MMT framework. Experiments are carried out on the Multi30K and CoMMuTE dataset across English-German and English-French translation tasks. Our results reveal that pre-training plays a crucial yet asymmetrical role in multimodal settings: pre-trained decoders consistently yield more fluent and accurate outputs, while pre-trained encoders show varied effects depending on the quality of visual-text alignment. Furthermore, we provide insights into the interplay between modality fusion and pre-trained components, offering guidance for future architecture design in multimodal translation systems.

Paper Structure

This paper contains 16 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Our unified MMT baseline. For the visual encoder, we use CLIP to extract image features. For the text encoder and decoder, we will explore the impact of the pre-trained components on the entire baseline.
  • Figure 2: The trend of BLEU and METEOR scores of the pre-trained model from scratch with epoch on Multi30k.
  • Figure 3: Evaluation of pre-trained encoders and decoders in the En-De direction of the Multi30k and CoMMuTE datasets, comparing the performance of different models in two scenarios: normal alignment and shuffled alignment.
  • Figure 4: Case study on the CoMMuTE dataset of English-to-German translation direction.