Table of Contents
Fetching ...

Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models

Lucky Susanto, Musa Izzanardi Wijanarko, Khumaisa Nur'aini, Farid Adilazuarda, Alham Fikri Aji, Derry Tanti Wijaya

TL;DR

The paper investigates whether visual rendering can bypass tokenization in language models by evaluating DualGPT on four Indonesian scripts. It finds that reintroducing a text tokenizer reintroduces tokenization misalignment, with a grapheme-based custom tokenizer delivering substantial gains over the Llama 2 tokenizer in monolingual settings, while cross-lingual transfer remains weak. The findings highlight that tokenizer alignment is a critical bottleneck even for pixel-based approaches, underscoring the need for careful tokenizer design and evaluation in multimodal models. Overall, the work cautions against assuming tokenization-free advantages in pixel-based architectures and calls for targeted architectural and data-centric strategies to improve script-level equity.

Abstract

While pixel-based language modeling aims to bypass the sub-word tokenization bottleneck by rendering text as images, recent multimodal variants such as DualGPT reintroduce text tokenizers to improve autoregressive performance. We investigate a fundamental question, does visual rendering truly decouple a model from tokenization constraints? Focusing on four Indonesian low-resource local languages that have their own non-Latin scripts (i.e., Javanese, Balinese, Sundanese, and Lampungnese), we evaluate the impact of script-tokenizer alignment within the DualGPT architecture. Our results show that, despite visual rendering, reintegrating a text tokenizer into the architecture reintroduces the same issue that pixel-based language modeling aims to resolve, which is the tokenizer misalignment problem. Despite having lower OOV and fertility rates, we show that the Llama 2 tokenizer performs significantly worse than a custom tokenizer, with improvements of up to 30.15 chrF++. Our findings serve as a warning for future multimodal variants, as text tokenizers remain a significant barrier to equitable models.

Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models

TL;DR

The paper investigates whether visual rendering can bypass tokenization in language models by evaluating DualGPT on four Indonesian scripts. It finds that reintroducing a text tokenizer reintroduces tokenization misalignment, with a grapheme-based custom tokenizer delivering substantial gains over the Llama 2 tokenizer in monolingual settings, while cross-lingual transfer remains weak. The findings highlight that tokenizer alignment is a critical bottleneck even for pixel-based approaches, underscoring the need for careful tokenizer design and evaluation in multimodal models. Overall, the work cautions against assuming tokenization-free advantages in pixel-based architectures and calls for targeted architectural and data-centric strategies to improve script-level equity.

Abstract

While pixel-based language modeling aims to bypass the sub-word tokenization bottleneck by rendering text as images, recent multimodal variants such as DualGPT reintroduce text tokenizers to improve autoregressive performance. We investigate a fundamental question, does visual rendering truly decouple a model from tokenization constraints? Focusing on four Indonesian low-resource local languages that have their own non-Latin scripts (i.e., Javanese, Balinese, Sundanese, and Lampungnese), we evaluate the impact of script-tokenizer alignment within the DualGPT architecture. Our results show that, despite visual rendering, reintegrating a text tokenizer into the architecture reintroduces the same issue that pixel-based language modeling aims to resolve, which is the tokenizer misalignment problem. Despite having lower OOV and fertility rates, we show that the Llama 2 tokenizer performs significantly worse than a custom tokenizer, with improvements of up to 30.15 chrF++. Our findings serve as a warning for future multimodal variants, as text tokenizers remain a significant barrier to equitable models.
Paper Structure (29 sections, 4 figures, 9 tables)

This paper contains 29 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: General VLMs image transliteration performance averaged across Javanese, Balinese, Sundanese, and Lampung on the NusaAksara evaluation dataset. Llama-3.1-NN-VL-8B-V1 is the Nemotron-Nano variant.
  • Figure 2: Fertility rate on the evaluation dataset per language. Bali$^\dagger$ and Lampung$^\ddagger$ uses Javanese's and Sundanese's tokenizer, respectively.
  • Figure 3: Impact of tokenizer choice on Monolingual DualGPT pretraining and finetuning: Pretraining $\rightarrow$ Finetuning $\rightarrow$ Evaluation.
  • Figure 4: Dataset building process using custom tokenization rules.