Table of Contents
Fetching ...

PIXAR: Auto-Regressive Language Modeling in Pixel Space

Yintao Tai, Xiyang Liao, Alessandro Suglia, Antonio Vergari

TL;DR

PIXAR introduces a pixel-based autoregressive LLM that generates text by predicting image patches, eliminating the need for symbolic tokens. It employs a two-stage pretraining strategy: first a maximum-likelihood patch prediction, then a patch-wise context-aware adversarial loss to boost readability, balancing both objectives. The model achieves competitive results on GLUE compared to Pixel and GPT-2, and narrows the gap to GPT-2 on short generative tasks like LAMBADA and bAbI, while offering improved robustness to visual attacks. This work shows that perceptual input alone can support open-vocabulary text generation and prompts future exploration of multilingual and symbol-free language models that operate directly in pixel space.

Abstract

Recent work showed the possibility of building open-vocabulary large language models (LLMs) that directly operate on pixel representations. These models are implemented as autoencoders that reconstruct masked patches of rendered text. However, these pixel-based LLMs are limited to discriminative tasks (e.g., classification) and, similar to BERT, cannot be used to generate text. Therefore, they cannot be used for generative tasks such as free-form question answering. In this work, we introduce PIXAR, the first pixel-based autoregressive LLM that performs text generation. Consisting of only a decoder, PIXAR can perform free-form generative tasks while keeping the number of parameters on par with previous encoder-decoder models. Furthermore, we highlight the challenges of generating text as non-noisy images and show this is due to using a maximum likelihood objective. To overcome this problem, we propose an adversarial pretraining stage that improves the readability and accuracy of PIXAR by 8.1 on LAMBADA and 8.5 on bAbI -- making it comparable to GPT-2 on text generation tasks. This paves the way to build open-vocabulary LLMs that operate on perceptual input only and calls into question the necessity of the usual symbolic input representation, i.e., text as (sub)tokens.

PIXAR: Auto-Regressive Language Modeling in Pixel Space

TL;DR

PIXAR introduces a pixel-based autoregressive LLM that generates text by predicting image patches, eliminating the need for symbolic tokens. It employs a two-stage pretraining strategy: first a maximum-likelihood patch prediction, then a patch-wise context-aware adversarial loss to boost readability, balancing both objectives. The model achieves competitive results on GLUE compared to Pixel and GPT-2, and narrows the gap to GPT-2 on short generative tasks like LAMBADA and bAbI, while offering improved robustness to visual attacks. This work shows that perceptual input alone can support open-vocabulary text generation and prompts future exploration of multilingual and symbol-free language models that operate directly in pixel space.

Abstract

Recent work showed the possibility of building open-vocabulary large language models (LLMs) that directly operate on pixel representations. These models are implemented as autoencoders that reconstruct masked patches of rendered text. However, these pixel-based LLMs are limited to discriminative tasks (e.g., classification) and, similar to BERT, cannot be used to generate text. Therefore, they cannot be used for generative tasks such as free-form question answering. In this work, we introduce PIXAR, the first pixel-based autoregressive LLM that performs text generation. Consisting of only a decoder, PIXAR can perform free-form generative tasks while keeping the number of parameters on par with previous encoder-decoder models. Furthermore, we highlight the challenges of generating text as non-noisy images and show this is due to using a maximum likelihood objective. To overcome this problem, we propose an adversarial pretraining stage that improves the readability and accuracy of PIXAR by 8.1 on LAMBADA and 8.5 on bAbI -- making it comparable to GPT-2 on text generation tasks. This paves the way to build open-vocabulary LLMs that operate on perceptual input only and calls into question the necessity of the usual symbolic input representation, i.e., text as (sub)tokens.
Paper Structure (29 sections, 1 equation, 12 figures, 7 tables)

This paper contains 29 sections, 1 equation, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Pixar is the first generative language model operating on pixels only.Pixar accepts texts as images and also generates texts in image patches autoregressively, a challenging task.
  • Figure 2: A second stage adversarial training can improve the readability of text generated by Pixar when compared to noisy patches generated by MLE only. The prompt is from the LAMBADA test set, rendered as a binary image.
  • Figure 3: Pixar can generate readable and correct texts according to the prompt (darker) on both bAbI (top) and LAMBADA (bottom). We folded images into rectangles for better visibility.
  • Figure 4: Pixar is more robust than GPT-2 under high visual attack ratios, especially on LAMBADA. We measured zero-shot accuracy on LAMBADA and few-shot accuracy on bAbI.
  • Figure 5: Pixar looks at longer patch sequences in the first layers and then focuses on specific word-like sequences as shown by the above heatmaps of the attention weights for the first generated patch of "yadira". The same pattern is visible in many more examples in \ref{['fig:attn layer more']}.
  • ...and 7 more figures