Memory Reviving, Continuing Learning and Beyond: Evaluation of Pre-trained Encoders and Decoders for Multimodal Machine Translation
Zhuang Yu, Shiliang Sun, Jing Zhao, Tengfei Song, Hao Yang
TL;DR
This work systematically analyzes how pre-trained encoders and decoders influence Multimodal Machine Translation (MMT) within a unified CLIP-based framework across English–German and English–French tasks on Multi30K and CoMMuTE. It finds a clear asymmetry: pre-trained decoders consistently improve generation quality, while pre-trained encoders are beneficial only when visual-text alignment is strong. The study also highlights memory reviving (fast early gains) and continuing learning (stabilized, robust improvement) as key dynamics in adapting large pre-trained models to multimodal data. The results offer practical guidance for designing MMT architectures and stress the importance of high-quality cross-modal grounding and alignment to fully exploit multimodal signals.
Abstract
Multimodal Machine Translation (MMT) aims to improve translation quality by leveraging auxiliary modalities such as images alongside textual input. While recent advances in large-scale pre-trained language and vision models have significantly benefited unimodal natural language processing tasks, their effectiveness and role in MMT remain underexplored. In this work, we conduct a systematic study on the impact of pre-trained encoders and decoders in multimodal translation models. Specifically, we analyze how different training strategies, from training from scratch to using pre-trained and partially frozen components, affect translation performance under a unified MMT framework. Experiments are carried out on the Multi30K and CoMMuTE dataset across English-German and English-French translation tasks. Our results reveal that pre-training plays a crucial yet asymmetrical role in multimodal settings: pre-trained decoders consistently yield more fluent and accurate outputs, while pre-trained encoders show varied effects depending on the quality of visual-text alignment. Furthermore, we provide insights into the interplay between modality fusion and pre-trained components, offering guidance for future architecture design in multimodal translation systems.
