Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation
Julian Spravil, Sebastian Houben, Sven Behnke
TL;DR
This work tackles zero‑shot multilingual image captioning under cross‑lingual data scarcity by leveraging a modular encoder–decoder VLM that combines Florence‑2 and Gemma‑2 in cross‑attention configurations. It introduces two scaling laws that relate cross‑entropy loss to training compute, model size, seen data, and initial multilinguality, using a synthetic continuous pre‑training dataset across six languages and CLIP‑based alignment to link images with translations ($y=\alpha_0 C^{\alpha_1}+\epsilon$, $C=S F (1+P_t/P)$ and $y=\beta_0 P^{\beta_1} S^{\beta_2} T^{\beta_3}+\epsilon$). The results show strong fits for seen and unseen tasks, reveal that a language prefix can bootstrap zero‑shot captioning for unseen languages, and demonstrate that the scaling laws extend to downstream tasks after fine‑tuning (Multi30K, CoMMuTE, XM3600, COCO Karpathy). These findings guide practical decisions on model size, multilingual pre‑training, data collection, and task coverage to efficiently extend multilingual vision‑language capabilities. Overall, the framework provides predictable scaling behavior for cross‑task multilingual transfer in VLMs and informs strategies for deploying multilingual multimodal models in real‑world settings.
Abstract
Cross-lingual, cross-task transfer is challenged by task-specific data scarcity, which becomes more severe as language support grows and is further amplified in vision-language models (VLMs). We investigate multilingual generalization in encoder-decoder transformer VLMs to enable zero-shot image captioning in languages encountered only in the translation task. In this setting, the encoder must learn to generate generalizable, task-aware latent vision representations to instruct the decoder via inserted cross-attention layers. To analyze scaling behavior, we train Florence-2 based and Gemma-2 based models (0.4B to 11.2B parameters) on a synthetic dataset using varying compute budgets. While all languages in the dataset have image-aligned translations, only a subset of them include image captions. Notably, we show that captioning can emerge using a language prefix, even when this language only appears in the translation task. We find that indirect learning of unseen task-language pairs adheres to scaling laws that are governed by the multilinguality of the model, model size, and seen training samples. Finally, we demonstrate that the scaling laws extend to downstream tasks, achieving competitive performance through fine-tuning in multimodal machine translation (Multi30K, CoMMuTE), lexical disambiguation (CoMMuTE), and image captioning (Multi30K, XM3600, COCO Karpathy).
