Table of Contents
Fetching ...

A Token-level Text Image Foundation Model for Document Understanding

Tongkun Guan, Zining Wang, Pei Fu, Zhengtao Guo, Wei Shen, Kai Zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang

TL;DR

This work addresses the challenge of fine-grained text understanding in images containing dense and small text by introducing a token-level visual foundation model (TokenFD) trained on the TokenIT dataset (20M images, 1.8B token-mask pairs). It then builds TokenVL, a document-focused multimodal LLM that combines TokenFD with LLM-guided token alignment and supervised instruction tuning to excel in VQA-based document understanding, achieving strong results on OCRBench and documented VQA benchmarks. Ablation studies confirm that explicit token-level alignment and a learnable token abstractor are critical for performance. Overall, TokenIT/TokenFD/TokenVL establish a scalable, token-granularity pathway to unify visual and textual representations for robust document perception, understanding, and reasoning.

Abstract

In recent years, general visual foundation models (VFMs) have witnessed increasing adoption, particularly as image encoders for popular multi-modal large language models (MLLMs). However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenOCR, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, TokenVL, for VQA-based document understanding tasks. Finally, extensive experiments demonstrate the effectiveness of TokenOCR and TokenVL. Code, datasets, and weights will be available at https://github.com/Token-family/TokenFD.

A Token-level Text Image Foundation Model for Document Understanding

TL;DR

This work addresses the challenge of fine-grained text understanding in images containing dense and small text by introducing a token-level visual foundation model (TokenFD) trained on the TokenIT dataset (20M images, 1.8B token-mask pairs). It then builds TokenVL, a document-focused multimodal LLM that combines TokenFD with LLM-guided token alignment and supervised instruction tuning to excel in VQA-based document understanding, achieving strong results on OCRBench and documented VQA benchmarks. Ablation studies confirm that explicit token-level alignment and a learnable token abstractor are critical for performance. Overall, TokenIT/TokenFD/TokenVL establish a scalable, token-granularity pathway to unify visual and textual representations for robust document perception, understanding, and reasoning.

Abstract

In recent years, general visual foundation models (VFMs) have witnessed increasing adoption, particularly as image encoders for popular multi-modal large language models (MLLMs). However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenOCR, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, TokenVL, for VQA-based document understanding tasks. Finally, extensive experiments demonstrate the effectiveness of TokenOCR and TokenVL. Code, datasets, and weights will be available at https://github.com/Token-family/TokenFD.

Paper Structure

This paper contains 21 sections, 6 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: For different tasks, previous works select different VFMs from general foundation models (path 1). In contrast, we develop a unified token-level foundation model, TokenFD, specifically tailored for text-image-related tasks (path 2). TokenFD is trained on a substantial self-built dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. This well-learned model is capable of supplanting other VFMs in related downstream tasks.
  • Figure 2: An overview of the self-constructed token-level TokenIT dataset, comprising 20 million images and 1.8 billion text-mask pairs. (a) provides a detailed description of each sample, including the raw image, a mask, and a JSON file that records BPE token information. We also count (b) the data distribution, (c) the number of selected BPE tokens, and (d) a word cloud map highlighting the top 100 BPE tokens.
  • Figure 3: An overview of the proposed TokenFD, where the token-level image features and token-level language features are aligned within the same semantic space. This "image-as-text" alignment seamlessly facilitates user-interactive applications, including text segmentation, retrieval, and visual question answering.
  • Figure 4: The framework of LLM-guided Token Alignment Training. Existing MLLMs primarily enhance spatial-wise text perception capabilities by integrating localization prompts to predict coordinates. However, this implicit method makes it difficult for these models to have a precise understanding. In contrast, the proposed token alignment uses BPE token masks to directly and explicitly align text with corresponding pixels in the input image, enhancing the MLLM's localization awareness.
  • Figure 5: More visualization examples of the natural scene images, document images, and code images.
  • ...and 2 more figures