Table of Contents
Fetching ...

ZeroSense:How Vision matters in Long Context Compression

Yonghan Gao, Zehong Chen, Lijian Xu, Jingzhi Chen, Jingwei Guan, Xingyu Zeng

Abstract

Recent visual-text compression (VTC) methods, typified by DeepSeek-OCR, report impressive high token compression ratios for long-context modeling tasks by leveraging text-to-image rendering. However, existing evaluation protocols heavily rely on downstream task performance. Such evaluation metrics fail to accurately measure text preservation due to the strong inherent linguistic priors of Multimodal Large Language Models (MLLMs). In this work, we introduce a new evaluation framework that decouples MLLMs' capabilities to faithfully assess VTC quality. Within this framework, we further introduce the ZeroSense Benchmark to ensure low semantic correlation of testing samples. By eliminating contextual dependencies, our benchmark guarantees that the evaluation results are purely reflective of VTC quality, unaffected by the semantic inference capabilities of downstream models. Extensive experiments across multiple datasets demonstrate that VTC quality and downstream task accuracy diverge significantly, highlighting the necessity of our decoupled evaluation framework.

ZeroSense:How Vision matters in Long Context Compression

Abstract

Recent visual-text compression (VTC) methods, typified by DeepSeek-OCR, report impressive high token compression ratios for long-context modeling tasks by leveraging text-to-image rendering. However, existing evaluation protocols heavily rely on downstream task performance. Such evaluation metrics fail to accurately measure text preservation due to the strong inherent linguistic priors of Multimodal Large Language Models (MLLMs). In this work, we introduce a new evaluation framework that decouples MLLMs' capabilities to faithfully assess VTC quality. Within this framework, we further introduce the ZeroSense Benchmark to ensure low semantic correlation of testing samples. By eliminating contextual dependencies, our benchmark guarantees that the evaluation results are purely reflective of VTC quality, unaffected by the semantic inference capabilities of downstream models. Extensive experiments across multiple datasets demonstrate that VTC quality and downstream task accuracy diverge significantly, highlighting the necessity of our decoupled evaluation framework.
Paper Structure (35 sections, 10 equations, 8 figures, 5 tables, 3 algorithms)

This paper contains 35 sections, 10 equations, 8 figures, 5 tables, 3 algorithms.

Figures (8)

  • Figure 1: Decoupled analysis of confounding factors in visual-text compression.(a) Semantic Priors Compensation: The model leverages contextual inference to rectify typographical errors in the source image. (b) Raw Perceptual Bottleneck: In a semantic vacuum, the model exhibits failures on incoherent alphanumeric strings despite high input resolution, revealing the intrinsic ceiling of its raw recognition capability. (c) Performance Decomposition Framework: End-to-end performance is deconstructed into the synergistic interaction between The text preserved by visual-text compression, The model’s raw recognition capability, and the model’s contextual inference capability.
  • Figure 1: Comparison between original and ZeroSense document images. Top: Full page views demonstrate that our generation pipeline perfectly preserves the document's structural context. Bottom: Transcribed zoomed-in regions highlight the semantic decoupling; true semantic priors are systematically replaced with tokens sampled from a low-probability vocabulary subset, isolating the visual layout characteristics.
  • Figure 2: Pipeline for semantically irrelevant text generation and rendering. The system comprises three modules: an Analyzer that extracts key visual attributes (e.g., word counts and bounding-box coordinates) from input images using OCR Tool or ground-truth annotations; a Text Generator that produces semantically irrelevant text with LLM; and a Renderer that synthesizes the final images by combining the extracted visual attributes with the generated text.
  • Figure 2: We compared the posterior probability distribution plots of textual content between Fox, Omni, and their corresponding rendered ZeroSense data. Yellow denotes the original data, and blue represents the rendered data. It can be clearly observed that the text in the original dataset exhibits strong semantic priors, whereas the textual content in ZeroSense is highly semantically irrelevant.
  • Figure 3: (a) The text token distribution density of the Fox and Omni datasets. The data distribution for Fox is highly concentrated around 900 tokens. Conversely, Omni exhibits a prominent long-tail characteristic; a substantial volume of its data is clustered between 300 and 600 tokens, yet the proportion exceeding 2,100 tokens remains non-negligible. (b) and (c) illustrate typical document examples with extremely low and high token counts, respectively.
  • ...and 3 more figures