Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models
Lucky Susanto, Musa Izzanardi Wijanarko, Khumaisa Nur'aini, Farid Adilazuarda, Alham Fikri Aji, Derry Tanti Wijaya
TL;DR
The paper investigates whether visual rendering can bypass tokenization in language models by evaluating DualGPT on four Indonesian scripts. It finds that reintroducing a text tokenizer reintroduces tokenization misalignment, with a grapheme-based custom tokenizer delivering substantial gains over the Llama 2 tokenizer in monolingual settings, while cross-lingual transfer remains weak. The findings highlight that tokenizer alignment is a critical bottleneck even for pixel-based approaches, underscoring the need for careful tokenizer design and evaluation in multimodal models. Overall, the work cautions against assuming tokenization-free advantages in pixel-based architectures and calls for targeted architectural and data-centric strategies to improve script-level equity.
Abstract
While pixel-based language modeling aims to bypass the sub-word tokenization bottleneck by rendering text as images, recent multimodal variants such as DualGPT reintroduce text tokenizers to improve autoregressive performance. We investigate a fundamental question, does visual rendering truly decouple a model from tokenization constraints? Focusing on four Indonesian low-resource local languages that have their own non-Latin scripts (i.e., Javanese, Balinese, Sundanese, and Lampungnese), we evaluate the impact of script-tokenizer alignment within the DualGPT architecture. Our results show that, despite visual rendering, reintegrating a text tokenizer into the architecture reintroduces the same issue that pixel-based language modeling aims to resolve, which is the tokenizer misalignment problem. Despite having lower OOV and fertility rates, we show that the Llama 2 tokenizer performs significantly worse than a custom tokenizer, with improvements of up to 30.15 chrF++. Our findings serve as a warning for future multimodal variants, as text tokenizers remain a significant barrier to equitable models.
