Table of Contents
Fetching ...

DocSum: Domain-Adaptive Pre-training for Document Abstractive Summarization

Phan Phuong Mai Chau, Souhail Bakkali, Antoine Doucet

TL;DR

The paper tackles the challenge of abstractive summarization for OCR'ed administrative documents, where noise and domain specificity hinder standard models. It introduces DocSum, a framework that combines domain-adaptive pre-training of BART on OCR-transcribed corpora with DAS-enriched prompting that incorporates LLM-generated gold summaries and question–answer pairs. Ground-truth data are generated via Mistal-7B-Instruct, filtered by confidence scores, and used to fine-tune the model on RVL-CDIP, with evaluation on Document Abstractive Summarization (DAS) and Document Text Classification (DTC). Results show that domain adaptation and QA-enhanced inputs yield improvements in summary quality (e.g., higher BERTScore) and classification accuracy, illustrating practical benefits for business and public-sector document workflows. The work highlights ongoing challenges in OCR noise and LLM hallucinations, and proposes directions for more diverse data, robust prompting, and improved reliability in real-world applications.

Abstract

Abstractive summarization has made significant strides in condensing and rephrasing large volumes of text into coherent summaries. However, summarizing administrative documents presents unique challenges due to domain-specific terminology, OCR-generated errors, and the scarcity of annotated datasets for model fine-tuning. Existing models often struggle to adapt to the intricate structure and specialized content of such documents. To address these limitations, we introduce DocSum, a domain-adaptive abstractive summarization framework tailored for administrative documents. Leveraging pre-training on OCR-transcribed text and fine-tuning with an innovative integration of question-answer pairs, DocSum enhances summary accuracy and relevance. This approach tackles the complexities inherent in administrative content, ensuring outputs that align with real-world business needs. To evaluate its capabilities, we define a novel downstream task setting-Document Abstractive Summarization-which reflects the practical requirements of business and organizational settings. Comprehensive experiments demonstrate DocSum's effectiveness in producing high-quality summaries, showcasing its potential to improve decision-making and operational workflows across the public and private sectors.

DocSum: Domain-Adaptive Pre-training for Document Abstractive Summarization

TL;DR

The paper tackles the challenge of abstractive summarization for OCR'ed administrative documents, where noise and domain specificity hinder standard models. It introduces DocSum, a framework that combines domain-adaptive pre-training of BART on OCR-transcribed corpora with DAS-enriched prompting that incorporates LLM-generated gold summaries and question–answer pairs. Ground-truth data are generated via Mistal-7B-Instruct, filtered by confidence scores, and used to fine-tune the model on RVL-CDIP, with evaluation on Document Abstractive Summarization (DAS) and Document Text Classification (DTC). Results show that domain adaptation and QA-enhanced inputs yield improvements in summary quality (e.g., higher BERTScore) and classification accuracy, illustrating practical benefits for business and public-sector document workflows. The work highlights ongoing challenges in OCR noise and LLM hallucinations, and proposes directions for more diverse data, robust prompting, and improved reliability in real-world applications.

Abstract

Abstractive summarization has made significant strides in condensing and rephrasing large volumes of text into coherent summaries. However, summarizing administrative documents presents unique challenges due to domain-specific terminology, OCR-generated errors, and the scarcity of annotated datasets for model fine-tuning. Existing models often struggle to adapt to the intricate structure and specialized content of such documents. To address these limitations, we introduce DocSum, a domain-adaptive abstractive summarization framework tailored for administrative documents. Leveraging pre-training on OCR-transcribed text and fine-tuning with an innovative integration of question-answer pairs, DocSum enhances summary accuracy and relevance. This approach tackles the complexities inherent in administrative content, ensuring outputs that align with real-world business needs. To evaluate its capabilities, we define a novel downstream task setting-Document Abstractive Summarization-which reflects the practical requirements of business and organizational settings. Comprehensive experiments demonstrate DocSum's effectiveness in producing high-quality summaries, showcasing its potential to improve decision-making and operational workflows across the public and private sectors.

Paper Structure

This paper contains 20 sections, 2 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Example output summary. The first box displays text extracted from the left image. The second box presents a question-answer pair that highlights key information from the document. The third box provides a summary of the OCRed text and the associated question-answer pair. Words in red indicate OCR errors.
  • Figure 2: The overall pipeline. During pre-training, OCRed text and LLM-generated question-answer pairs are combined as input to further train the pre-trained BART language model, adapting it to domain-specific knowledge. In the fine-tuning phase, selected documents, along with their question-answer pairs and LLM-generated gold summaries, are used to fine-tune the pre-trained DocSum model. Additionally, LLM prompts include OCRed text, context (such as document category and key information), instructions for data generation, and output indicators specifying the desired response type.
  • Figure 3: Different input formats according to different prompts.
  • Figure 4: Analysis of document characteristics from the IIT-CDIP and RVL-CDIP datasets.
  • Figure 5: Comparison of generated summaries: (a) Example with lower BERTScore due to noisy input, and (b) Example with higher BERTScore due to clean input.