Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models

Lucky Susanto; Musa Izzanardi Wijanarko; Khumaisa Nur'aini; Farid Adilazuarda; Alham Fikri Aji; Derry Tanti Wijaya

Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models

Lucky Susanto, Musa Izzanardi Wijanarko, Khumaisa Nur'aini, Farid Adilazuarda, Alham Fikri Aji, Derry Tanti Wijaya

TL;DR

The paper investigates whether visual rendering can bypass tokenization in language models by evaluating DualGPT on four Indonesian scripts. It finds that reintroducing a text tokenizer reintroduces tokenization misalignment, with a grapheme-based custom tokenizer delivering substantial gains over the Llama 2 tokenizer in monolingual settings, while cross-lingual transfer remains weak. The findings highlight that tokenizer alignment is a critical bottleneck even for pixel-based approaches, underscoring the need for careful tokenizer design and evaluation in multimodal models. Overall, the work cautions against assuming tokenization-free advantages in pixel-based architectures and calls for targeted architectural and data-centric strategies to improve script-level equity.

Abstract

While pixel-based language modeling aims to bypass the sub-word tokenization bottleneck by rendering text as images, recent multimodal variants such as DualGPT reintroduce text tokenizers to improve autoregressive performance. We investigate a fundamental question, does visual rendering truly decouple a model from tokenization constraints? Focusing on four Indonesian low-resource local languages that have their own non-Latin scripts (i.e., Javanese, Balinese, Sundanese, and Lampungnese), we evaluate the impact of script-tokenizer alignment within the DualGPT architecture. Our results show that, despite visual rendering, reintegrating a text tokenizer into the architecture reintroduces the same issue that pixel-based language modeling aims to resolve, which is the tokenizer misalignment problem. Despite having lower OOV and fertility rates, we show that the Llama 2 tokenizer performs significantly worse than a custom tokenizer, with improvements of up to 30.15 chrF++. Our findings serve as a warning for future multimodal variants, as text tokenizers remain a significant barrier to equitable models.

Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models

TL;DR

Abstract

Paper Structure (29 sections, 4 figures, 9 tables)

This paper contains 29 sections, 4 figures, 9 tables.

Introduction
Background and Motivation
Tokenization and Cross-lingual Disparities.
Pixel-based Language Modeling.
Multimodal Variants.
Methodology
Datasets
PixelGPT
Tokenizer and Renderer
Evaluation
Model Setup
Tokenizer Statistics
Results
VLM Evaluation
DualGPT Architecture
...and 14 more sections

Figures (4)

Figure 1: General VLMs image transliteration performance averaged across Javanese, Balinese, Sundanese, and Lampung on the NusaAksara evaluation dataset. Llama-3.1-NN-VL-8B-V1 is the Nemotron-Nano variant.
Figure 2: Fertility rate on the evaluation dataset per language. Bali$^\dagger$ and Lampung$^\ddagger$ uses Javanese's and Sundanese's tokenizer, respectively.
Figure 3: Impact of tokenizer choice on Monolingual DualGPT pretraining and finetuning: Pretraining $\rightarrow$ Finetuning $\rightarrow$ Evaluation.
Figure 4: Dataset building process using custom tokenization rules.

Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models

TL;DR

Abstract

Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)