Table of Contents
Fetching ...

ProcTag: Process Tagging for Assessing the Efficacy of Document Instruction Data

Yufan Shen, Chuwei Luo, Zhaoqing Zhu, Yang Chen, Qi Zheng, Zhi Yu, Jiajun Bu, Cong Yao

TL;DR

The paper tackles the challenge of evaluating the efficacy of document instruction data for training LLMs/MLLMs on document VQA. It introduces ProcTag, which tags the instruction execution process through CoT-inspired pseudo-code and a layout-aware document representation called DocLayPrompt, enabling data-driven sampling that emphasizes process diversity and complexity. Empirical results show ProcTag-based sampling outperforms text-focused baselines and, notably, only about 30.5% of the dataset is needed to reach 100% efficacy, significantly boosting training efficiency. The approach offers a data-centric framework with practical impact for document understanding and motivates future generalization to broader AI data evaluation tasks.

Abstract

Recently, large language models (LLMs) and multimodal large language models (MLLMs) have demonstrated promising results on document visual question answering (VQA) task, particularly after training on document instruction datasets. An effective evaluation method for document instruction data is crucial in constructing instruction data with high efficacy, which, in turn, facilitates the training of LLMs and MLLMs for document VQA. However, most existing evaluation methods for instruction data are limited to the textual content of the instructions themselves, thereby hindering the effective assessment of document instruction datasets and constraining their construction. In this paper, we propose ProcTag, a data-oriented method that assesses the efficacy of document instruction data. ProcTag innovatively performs tagging on the execution process of instructions rather than the instruction text itself. By leveraging the diversity and complexity of these tags to assess the efficacy of the given dataset, ProcTag enables selective sampling or filtering of document instructions. Furthermore, DocLayPrompt, a novel semi-structured layout-aware document prompting strategy, is proposed for effectively representing documents. Experiments demonstrate that sampling existing open-sourced and generated document VQA/instruction datasets with ProcTag significantly outperforms current methods for evaluating instruction data. Impressively, with ProcTag-based sampling in the generated document datasets, only 30.5\% of the document instructions are required to achieve 100\% efficacy compared to the complete dataset. The code is publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/ProcTag.

ProcTag: Process Tagging for Assessing the Efficacy of Document Instruction Data

TL;DR

The paper tackles the challenge of evaluating the efficacy of document instruction data for training LLMs/MLLMs on document VQA. It introduces ProcTag, which tags the instruction execution process through CoT-inspired pseudo-code and a layout-aware document representation called DocLayPrompt, enabling data-driven sampling that emphasizes process diversity and complexity. Empirical results show ProcTag-based sampling outperforms text-focused baselines and, notably, only about 30.5% of the dataset is needed to reach 100% efficacy, significantly boosting training efficiency. The approach offers a data-centric framework with practical impact for document understanding and motivates future generalization to broader AI data evaluation tasks.

Abstract

Recently, large language models (LLMs) and multimodal large language models (MLLMs) have demonstrated promising results on document visual question answering (VQA) task, particularly after training on document instruction datasets. An effective evaluation method for document instruction data is crucial in constructing instruction data with high efficacy, which, in turn, facilitates the training of LLMs and MLLMs for document VQA. However, most existing evaluation methods for instruction data are limited to the textual content of the instructions themselves, thereby hindering the effective assessment of document instruction datasets and constraining their construction. In this paper, we propose ProcTag, a data-oriented method that assesses the efficacy of document instruction data. ProcTag innovatively performs tagging on the execution process of instructions rather than the instruction text itself. By leveraging the diversity and complexity of these tags to assess the efficacy of the given dataset, ProcTag enables selective sampling or filtering of document instructions. Furthermore, DocLayPrompt, a novel semi-structured layout-aware document prompting strategy, is proposed for effectively representing documents. Experiments demonstrate that sampling existing open-sourced and generated document VQA/instruction datasets with ProcTag significantly outperforms current methods for evaluating instruction data. Impressively, with ProcTag-based sampling in the generated document datasets, only 30.5\% of the document instructions are required to achieve 100\% efficacy compared to the complete dataset. The code is publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/ProcTag.
Paper Structure (20 sections, 3 equations, 8 figures, 2 tables)

This paper contains 20 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: The same instruction text can lead to entirely different execution processes when applied to different types of documents.
  • Figure 2: Overview of ProcTag. ProcTag performs tagging on the document instruction execution process for assessing the efficacy of document instruction data, involving three steps: (a) document representation: ProcTag utilizes DocLayPrompt for representing document information; (b) instruction execution process generation: prompting GPT to generate the execution process using pseudo-code; and (c) process tagging: processing the generated pseudo-code to obtain instruction tags.
  • Figure 3: Experimental results of the performance in document VQA after training on LLM (Qwen) and MLLM (Qwen-VL) using datasets sampled with ProcTag, InsTag, and random sampling methods from human-annotated (DocVQA) and generated document instruction datasets.
  • Figure 4: Experimental analysis results of data efficacy in terms of data amount ratio and tag coverage rate.
  • Figure 5: image features
  • ...and 3 more figures