Table of Contents
Fetching ...

DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering

Alex Nguyen, Zilong Wang, Jingbo Shang, Dheeraj Mekala

TL;DR

DocMaster introduces an on-device, unified platform for annotation, training, and inference in document-question-answering on PDFs, addressing layout-structural and privacy challenges. It provides three integrated interfaces and a backend-frontend dataflow that preserves word-level layout by combining PDF.js rendering with PyMuPDF bounding boxes, with model training via HuggingFace transformers stored in a local SQL database. The platform supports both layout-aware and text-only models and demonstrates private, end-to-end QA workflows entirely on premises. In a UCSD ISEO deployment, DocMaster delivers substantial throughput gains, illustrating the practical impact of a private, end-to-end document QA solution. The work positions DocMaster as an open-source framework to empower organizations to deploy bespoke, privacy-preserving document QA pipelines.

Abstract

The application of natural language processing models to PDF documents is pivotal for various business applications yet the challenge of training models for this purpose persists in businesses due to specific hurdles. These include the complexity of working with PDF formats that necessitate parsing text and layout information for curating training data and the lack of privacy-preserving annotation tools. This paper introduces DOCMASTER, a unified platform designed for annotating PDF documents, model training, and inference, tailored to document question-answering. The annotation interface enables users to input questions and highlight text spans within the PDF file as answers, saving layout information and text spans accordingly. Furthermore, DOCMASTER supports both state-of-the-art layout-aware and text models for comprehensive training purposes. Importantly, as annotations, training, and inference occur on-device, it also safeguards privacy. The platform has been instrumental in driving several research prototypes concerning document analysis such as the AI assistant utilized by University of California San Diego's (UCSD) International Services and Engagement Office (ISEO) for processing a substantial volume of PDF documents.

DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering

TL;DR

DocMaster introduces an on-device, unified platform for annotation, training, and inference in document-question-answering on PDFs, addressing layout-structural and privacy challenges. It provides three integrated interfaces and a backend-frontend dataflow that preserves word-level layout by combining PDF.js rendering with PyMuPDF bounding boxes, with model training via HuggingFace transformers stored in a local SQL database. The platform supports both layout-aware and text-only models and demonstrates private, end-to-end QA workflows entirely on premises. In a UCSD ISEO deployment, DocMaster delivers substantial throughput gains, illustrating the practical impact of a private, end-to-end document QA solution. The work positions DocMaster as an open-source framework to empower organizations to deploy bespoke, privacy-preserving document QA pipelines.

Abstract

The application of natural language processing models to PDF documents is pivotal for various business applications yet the challenge of training models for this purpose persists in businesses due to specific hurdles. These include the complexity of working with PDF formats that necessitate parsing text and layout information for curating training data and the lack of privacy-preserving annotation tools. This paper introduces DOCMASTER, a unified platform designed for annotating PDF documents, model training, and inference, tailored to document question-answering. The annotation interface enables users to input questions and highlight text spans within the PDF file as answers, saving layout information and text spans accordingly. Furthermore, DOCMASTER supports both state-of-the-art layout-aware and text models for comprehensive training purposes. Importantly, as annotations, training, and inference occur on-device, it also safeguards privacy. The platform has been instrumental in driving several research prototypes concerning document analysis such as the AI assistant utilized by University of California San Diego's (UCSD) International Services and Engagement Office (ISEO) for processing a substantial volume of PDF documents.
Paper Structure (17 sections, 1 equation, 5 figures, 1 table)

This paper contains 17 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: DocMaster supports annotation, model training, and inference functionalities for document question-answering in a single platform.
  • Figure 2: Training and inference with layout-aware models requires a bounding box for each word. PDF.js cannot reliably provide this data because of its phrase-level bounding boxes instead of word-level and empty bounding boxes. PyMuPDF solves this issue, but the text parsed by PDF.js and PyMuPDF can differ. DocMaster uses PDF.js for frontend rendering and PyMuPDF in the backend and provides a robust method for mapping a PDF.js selection to the PyMuPDF context.
  • Figure 3: The annotation interface of DocMaster. The users upload a PDF/a zip of PDFs, input their questions and highlight the answers in each PDF.
  • Figure 4: In training interface, the users can select one of the base models and train it using the previously annotated documents. Each row in the table indicates an annotation session and shows the number of documents annotated during that session.
  • Figure 5: Highlighted answers for questions asked by ISEO office staff on a supporting document. The questions are: "What is the job title?" (red), "What are the work hours per week?" (orange), "What is the salary or hourly rate?" (blue), "Where is the internship address?" (green). Private information is redacted.