Table of Contents
Fetching ...

Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation

Julian Spravil, Sebastian Houben, Sven Behnke

TL;DR

This work tackles zero‑shot multilingual image captioning under cross‑lingual data scarcity by leveraging a modular encoder–decoder VLM that combines Florence‑2 and Gemma‑2 in cross‑attention configurations. It introduces two scaling laws that relate cross‑entropy loss to training compute, model size, seen data, and initial multilinguality, using a synthetic continuous pre‑training dataset across six languages and CLIP‑based alignment to link images with translations ($y=\alpha_0 C^{\alpha_1}+\epsilon$, $C=S F (1+P_t/P)$ and $y=\beta_0 P^{\beta_1} S^{\beta_2} T^{\beta_3}+\epsilon$). The results show strong fits for seen and unseen tasks, reveal that a language prefix can bootstrap zero‑shot captioning for unseen languages, and demonstrate that the scaling laws extend to downstream tasks after fine‑tuning (Multi30K, CoMMuTE, XM3600, COCO Karpathy). These findings guide practical decisions on model size, multilingual pre‑training, data collection, and task coverage to efficiently extend multilingual vision‑language capabilities. Overall, the framework provides predictable scaling behavior for cross‑task multilingual transfer in VLMs and informs strategies for deploying multilingual multimodal models in real‑world settings.

Abstract

Cross-lingual, cross-task transfer is challenged by task-specific data scarcity, which becomes more severe as language support grows and is further amplified in vision-language models (VLMs). We investigate multilingual generalization in encoder-decoder transformer VLMs to enable zero-shot image captioning in languages encountered only in the translation task. In this setting, the encoder must learn to generate generalizable, task-aware latent vision representations to instruct the decoder via inserted cross-attention layers. To analyze scaling behavior, we train Florence-2 based and Gemma-2 based models (0.4B to 11.2B parameters) on a synthetic dataset using varying compute budgets. While all languages in the dataset have image-aligned translations, only a subset of them include image captions. Notably, we show that captioning can emerge using a language prefix, even when this language only appears in the translation task. We find that indirect learning of unseen task-language pairs adheres to scaling laws that are governed by the multilinguality of the model, model size, and seen training samples. Finally, we demonstrate that the scaling laws extend to downstream tasks, achieving competitive performance through fine-tuning in multimodal machine translation (Multi30K, CoMMuTE), lexical disambiguation (CoMMuTE), and image captioning (Multi30K, XM3600, COCO Karpathy).

Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation

TL;DR

This work tackles zero‑shot multilingual image captioning under cross‑lingual data scarcity by leveraging a modular encoder–decoder VLM that combines Florence‑2 and Gemma‑2 in cross‑attention configurations. It introduces two scaling laws that relate cross‑entropy loss to training compute, model size, seen data, and initial multilinguality, using a synthetic continuous pre‑training dataset across six languages and CLIP‑based alignment to link images with translations (, and ). The results show strong fits for seen and unseen tasks, reveal that a language prefix can bootstrap zero‑shot captioning for unseen languages, and demonstrate that the scaling laws extend to downstream tasks after fine‑tuning (Multi30K, CoMMuTE, XM3600, COCO Karpathy). These findings guide practical decisions on model size, multilingual pre‑training, data collection, and task coverage to efficiently extend multilingual vision‑language capabilities. Overall, the framework provides predictable scaling behavior for cross‑task multilingual transfer in VLMs and informs strategies for deploying multilingual multimodal models in real‑world settings.

Abstract

Cross-lingual, cross-task transfer is challenged by task-specific data scarcity, which becomes more severe as language support grows and is further amplified in vision-language models (VLMs). We investigate multilingual generalization in encoder-decoder transformer VLMs to enable zero-shot image captioning in languages encountered only in the translation task. In this setting, the encoder must learn to generate generalizable, task-aware latent vision representations to instruct the decoder via inserted cross-attention layers. To analyze scaling behavior, we train Florence-2 based and Gemma-2 based models (0.4B to 11.2B parameters) on a synthetic dataset using varying compute budgets. While all languages in the dataset have image-aligned translations, only a subset of them include image captions. Notably, we show that captioning can emerge using a language prefix, even when this language only appears in the translation task. We find that indirect learning of unseen task-language pairs adheres to scaling laws that are governed by the multilinguality of the model, model size, and seen training samples. Finally, we demonstrate that the scaling laws extend to downstream tasks, achieving competitive performance through fine-tuning in multimodal machine translation (Multi30K, CoMMuTE), lexical disambiguation (CoMMuTE), and image captioning (Multi30K, XM3600, COCO Karpathy).

Paper Structure

This paper contains 20 sections, 2 equations, 12 figures, 16 tables.

Figures (12)

  • Figure 1: We train a vision-language model (VLM; middle) on an incomplete dataset (left) that covers the tasks image captioning (blue) and multimodal machine translation (orange). While En$\rightarrow$X translation is available for all languages, captioning data is limited to only English and German. The VLM generalizes to the missing captioning-language pairs with sufficient scale (right).
  • Figure 2: Test cross-entropy (CE) loss for various training compute budgets (GMACs, Giga multiply-accumulate operations). We show results for the test splits for unseen captioning (UC) in Spanish (Es) and Chinese (Zh), seen translation (ST) in the same languages, and seen captioning (SC) in English (En) and German (De). All models are trained for 0.5M, 2M, 5M, and 10M seen samples. Equation \ref{['eq:powerlaw1']} is fitted to the points on the Pareto frontier (gray staircase graph). Higher compute budgets improve CE loss for UC (left), ST (middle), and SC (right). This suggests that translation facilitates generalization in captioning.
  • Figure 3: Test CE loss as a function of model size ($P$), number of seen samples ($S$), and initial CE loss ($T$) across the three test splits: UC, ST, and SC. The dashed lines represent the fitted functions for three values of $T$: $T = 10.44$ for Florence-2 based models, $T = 6.02$ for Gemma-2 based models, and $T = 3.0$ for a hypothetical highly multilingual VLM. Line thickness is proportional to the $T$ value. The measured results for all evaluated models are shown as points. The 10M seen-sample line is highlighted in orange, while lower sample counts are represented by progressively lighter shades of gray. Notably, for the UC task, test CE loss decreases as $P$ and $S$ increase and $T$ decreases.
  • Figure 4: Effect of adding a prefix (Fr: "La photo montre", etc.) to the decoder input to unlock zero-shot captioning. Tested on the image captioning dataset XM3600 in the unseen languages Fr, Es, Ru, and Zh. The mean CIDEr over unseen languages significantly improves with the prefix.
  • Figure 5: Downstream task performance with respect to CE loss, measured on the UC, ST, and SC tasks, depending on the type of downstream task. First row: Multi30K translation to De and Fr measured in BLEU (Task 1; mean over Test2016, Test2017 and AmbiguousCOCO splits), CoMMuTE translation and disambiguation for En$\rightarrow$De and En$\rightarrow${De, Fr, Ru, Zh} measured in BLEU and accuracy, respectively. Second row: Captioning tasks measured with CIDEr: COCO Karpathy (En), Multi30K (En, De) (Task 2, Test2016), and XM3600 for En, De, and unseen languages (Fr, Es, Ru, Zh). We use a consistent y-axis scale for matching dataset and task.
  • ...and 7 more figures