Table of Contents
Fetching ...

PixelWorld: How Far Are We from Perceiving Everything as Pixels?

Zhiheng Lyu, Xueguang Ma, Wenhu Chen

TL;DR

PixelWorld investigates unified perception by representing all inputs as pixels through the PEAP framework. It introduces a PixelWorld benchmark that converts text, tables, code, and diagrams into a shared pixel space and evaluates vision–language models across task genres and scales. The findings show PEAP matches token-based methods on semantic understanding but struggles on reasoning tasks like mathematics and programming, though Chain-of-Thought prompting provides partial gains; efficiency gains via PEAP-Fast offer substantial speedups. Overall, the work demonstrates both potential and limitations of pixel-based multimodal learning and provides a practical framework for diagnosing and advancing unified vision–language representations.

Abstract

Recent agentic language models increasingly need to interact with real-world environments that contain tightly intertwined visual and textual information, often through raw camera pixels rather than separately processed images and tokenized text. This shift highlights the need for a unified perception paradigm. To investigate this idea, we explore Perceive Everything as Pixels (PEAP) and introduce PixelWorld, a benchmark that renders natural-language, tabular, mathematical, and diagrammatic inputs into a shared pixel space. Experiments across multiple benchmarks show that PEAP achieves comparable performance to token-based approaches on semantic understanding tasks, suggesting that vision transformers can partially capture global textual semantics without explicit tokenization. In contrast, reasoning-intensive tasks such as mathematics and code show notable performance degradation, although Chain-of-Thought prompting helps mitigate this gap by compensating for missing symbolic structure. We further find that when visual and textual information are closely integrated, representing everything as pixels simplifies preprocessing and avoids cross-modal misalignment. PixelWorld thus provides a systematic and practical framework for evaluating unified vision--language models and facilitates further exploration of pixel-based multimodal learning.

PixelWorld: How Far Are We from Perceiving Everything as Pixels?

TL;DR

PixelWorld investigates unified perception by representing all inputs as pixels through the PEAP framework. It introduces a PixelWorld benchmark that converts text, tables, code, and diagrams into a shared pixel space and evaluates vision–language models across task genres and scales. The findings show PEAP matches token-based methods on semantic understanding but struggles on reasoning tasks like mathematics and programming, though Chain-of-Thought prompting provides partial gains; efficiency gains via PEAP-Fast offer substantial speedups. Overall, the work demonstrates both potential and limitations of pixel-based multimodal learning and provides a practical framework for diagnosing and advancing unified vision–language representations.

Abstract

Recent agentic language models increasingly need to interact with real-world environments that contain tightly intertwined visual and textual information, often through raw camera pixels rather than separately processed images and tokenized text. This shift highlights the need for a unified perception paradigm. To investigate this idea, we explore Perceive Everything as Pixels (PEAP) and introduce PixelWorld, a benchmark that renders natural-language, tabular, mathematical, and diagrammatic inputs into a shared pixel space. Experiments across multiple benchmarks show that PEAP achieves comparable performance to token-based approaches on semantic understanding tasks, suggesting that vision transformers can partially capture global textual semantics without explicit tokenization. In contrast, reasoning-intensive tasks such as mathematics and code show notable performance degradation, although Chain-of-Thought prompting helps mitigate this gap by compensating for missing symbolic structure. We further find that when visual and textual information are closely integrated, representing everything as pixels simplifies preprocessing and avoids cross-modal misalignment. PixelWorld thus provides a systematic and practical framework for evaluating unified vision--language models and facilitates further exploration of pixel-based multimodal learning.

Paper Structure

This paper contains 20 sections, 1 equation, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Overview of the PEAP Framework.PEAP (Perceive Everything as Pixels) unifies text, structural, and visual inputs into a single pixel space, where a Vision Transformer (ViT) encodes the pixels and a language decoder performs reasoning. Both components are enclosed within the dashed box to indicate that they jointly form a vision–language model (VLM). By eliminating modality-specific preprocessing such as OCR and tokenization, PEAP better aligns with human perception and reduces cross-modal engineering overhead.
  • Figure 2: Key Findings on the PixelWorld Benchmark. Evaluated across text-only, structural, and multimodal settings (§\ref{['sec:data']}, §\ref{['sec:experiment']}), PEAP shows four major insights: (1) Modality Trend: consistent gains on layout-heavy and multimodal tasks such as websites, slides, and documents; (2) Task Complexity: performance degradation on reasoning- and code-centric benchmarks (see §\ref{['sec:experiment_text']}–§\ref{['sec:experiment_structure']}); (3) Transferability by Scale: larger VLMs (e.g., GPT-4o, Gemini-Flash) exhibit smaller pixel–token gaps; and (4) Attention and Efficiency: text and image inputs show similar global attention patterns, while the proposed PEAP-Fast reduces up to 80% of computation overhead (§\ref{['sec:dis_2']}).
  • Figure 3: The performance of text-only datasets. The comparison is made between text input and synthesized image input. Most models demonstrate comparable performance on language understanding datasets such as SuperGLUE, GLUE, and ARC. However, notable performance disparities emerge between text-based input and synthesized image input on mathematical reasoning tasks (e.g., MMLU-Pro, GSM8K) and programming tasks (e.g., MBPP). Phi-3.5-Vision exhibits consistently poor performance across all vision tasks, primarily due to its insufficient instruction-following capabilities.
  • Figure 4: The performance of the structured dataset. We report all the subsets for the TableBench. In the semi setting, questions were presented as text, while tables were rendered as synthetic images. We observed that for tasks involving reasoning (numerical reasoning) and coding (visualization subset), synthetic images yielded inferior performance compared to text. However, for tasks emphasizing semantic understanding, such as data analysis and fact checking, synthetic images achieved performance comparable to or even surpassing text. Additionally, we found that the semi approach often performed worse than either text or synthetic images individually, providing insights into potential limitations and future directions for leveraging vision-language models (VLMs).
  • Figure 5: The performance of the multimodal dataset (MMMU-Pro). We adopt the result reported by the origin paper. We can observe that strong models perform better in PEAP.
  • ...and 8 more figures