Table of Contents
Fetching ...

Multilingual Pretraining for Pixel Language Models

Ilker Kesen, Jonas F. Lotz, Ingo Ziegler, Phillip Rust, Desmond Elliott

TL;DR

PIXEL-M4 presents the first multilingual pretraining of pixel-based language representations across four scripts (English, Hindi, Ukrainian, Simplified Chinese). Using a masked autoencoding objective and equal data across scripts, it demonstrates improved transfer to non-Latin languages in text classification, dependency parsing, and NER, while maintaining performance on Latin scripts. Word-level probing and hidden-representation analyses reveal richer linguistic features and a semantically aligned space across pretraining languages, especially in deeper layers. The work evidences the viability of tokenizer-free, cross-script representation learning and highlights directions for scaling to larger capacities and more languages.

Abstract

Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.

Multilingual Pretraining for Pixel Language Models

TL;DR

PIXEL-M4 presents the first multilingual pretraining of pixel-based language representations across four scripts (English, Hindi, Ukrainian, Simplified Chinese). Using a masked autoencoding objective and equal data across scripts, it demonstrates improved transfer to non-Latin languages in text classification, dependency parsing, and NER, while maintaining performance on Latin scripts. Word-level probing and hidden-representation analyses reveal richer linguistic features and a semantically aligned space across pretraining languages, especially in deeper layers. The work evidences the viability of tokenizer-free, cross-script representation learning and highlights directions for scaling to larger capacities and more languages.

Abstract

Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.

Paper Structure

This paper contains 24 sections, 19 figures, 7 tables.

Figures (19)

  • Figure 1: Average performance across tasks comparing pixel-m4 and pixel-bigrams grouped by scripts: Arabic, Brahmic, Chinese-Japanese-Korean, Cyrillic, Latin, and others. Both models share the same architecture and hyperparameters, but pixel-m4 is pretrained in four visually and linguistically diverse languages: English, Hindi, Ukrainian and Simplified Chinese. pixel-m4 performs better in almost all non-Latin script languages without sacrificing Latin-script performance.
  • Figure 2: Data‐efficient learning experiments on the Naamapadam NER benchmark showing the mean test F$_1$ score as a function of training set size in log scale for four Brahmic languages. In each experiment, pixel-m4 consistently outperforms pixel-bigrams, with the largest relative gains under the smallest data regimes.
  • Figure 3: Word-level probing analysis on linspector, where each row investigates a different task, and each column investigates a different language. In each subplot, y-axis represents the model accuracies and x-axis represents the corresponding layer number for the used hidden representations. Multilingually-pretrained pixel-m4 has learned better linguistic representations even for the languages with orthographically distant writing systems.
  • Figure 4: t-SNE visualization of the outputs for the specified layers. Each row contains visualizations for a particular model, and each column focuses on a particular layer. Each '$\bm{\times}$' marker appear at the centroid of a different pretraining language seen by pixel-m4. Both models cluster languages based on their scripts, yet pixel-m4 clusters some pretraining languages in the later layers.
  • Figure 5: Cross-lingual similarity analysis on SIB-200 using the mean pooled hidden representations of pixel-m4. The x-axis indicates the layer number; the y-axis reports the performance using recall@5. Each line focuses on a different language-pair combination. The dashed line shows the maximum recall@5 value obtained by pixel-bigrams for these language pairs. This analysis reveals that pixel-m4 has learned a mutual semantic representation for some pretraining language pairs.
  • ...and 14 more figures