Multilingual Pretraining for Pixel Language Models

Ilker Kesen; Jonas F. Lotz; Ingo Ziegler; Phillip Rust; Desmond Elliott

Multilingual Pretraining for Pixel Language Models

Ilker Kesen, Jonas F. Lotz, Ingo Ziegler, Phillip Rust, Desmond Elliott

TL;DR

PIXEL-M4 presents the first multilingual pretraining of pixel-based language representations across four scripts (English, Hindi, Ukrainian, Simplified Chinese). Using a masked autoencoding objective and equal data across scripts, it demonstrates improved transfer to non-Latin languages in text classification, dependency parsing, and NER, while maintaining performance on Latin scripts. Word-level probing and hidden-representation analyses reveal richer linguistic features and a semantically aligned space across pretraining languages, especially in deeper layers. The work evidences the viability of tokenizer-free, cross-script representation learning and highlights directions for scaling to larger capacities and more languages.

Abstract

Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.

Multilingual Pretraining for Pixel Language Models

TL;DR

Abstract

Multilingual Pretraining for Pixel Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)