Table of Contents
Fetching ...

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, Jie Zhou

TL;DR

The paper introduces a distillation-free, two-stage framework for end-to-end document conversion that avoids reliance on teacher models. The Uniform Format Warm-up Stage produces a large, diverse synthetic dataset with unified outputs for text, tables, and formulas, while the Iterative Self-improvement Stage automatically refines real-world data through rule-based filtering and retraining. This combination yields a compact POINTS-Reader model that achieves state-of-the-art-like performance on multiple benchmarks without distillation, surpassing several larger models and rivaling expert OCR systems in key tasks. The approach demonstrates robust improvements in data quality and model generalization across complex documents, with practical release and clear limitations in language and font coverage.

Abstract

High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

TL;DR

The paper introduces a distillation-free, two-stage framework for end-to-end document conversion that avoids reliance on teacher models. The Uniform Format Warm-up Stage produces a large, diverse synthetic dataset with unified outputs for text, tables, and formulas, while the Iterative Self-improvement Stage automatically refines real-world data through rule-based filtering and retraining. This combination yields a compact POINTS-Reader model that achieves state-of-the-art-like performance on multiple benchmarks without distillation, surpassing several larger models and rivaling expert OCR systems in key tasks. The approach demonstrates robust improvements in data quality and model generalization across complex documents, with practical release and clear limitations in language and font coverage.

Abstract

High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.

Paper Structure

This paper contains 49 sections, 3 equations, 24 figures, 8 tables.

Figures (24)

  • Figure 1: Example annotations generated by Qwen2.5-VL-72B and POINTS-Reader. Distillation may not reach the performance of the teacher model and can inherit its biases, such as (1) failure to recognize tables, (2) missing text, and (3) incorrect table structures.
  • Figure 2: Demonstration of the two-stage pipeline to generate large-scale high quality dataset.
  • Figure 3: (a) Scaling curve of data generated during the uniform format warm-up stage (lower is better). (b) Distribution of aspect ratios (width/height) in the original dataset. Samples with aspect ratios beyond the red dotted line are filtered out.
  • Figure 4: Model performance steady improves during the self-improvement stage.
  • Figure 5: The F1-score steadily improves during the self-improvement stage. The score is computed prior to data filtering.
  • ...and 19 more figures