DocSum: Domain-Adaptive Pre-training for Document Abstractive Summarization

Phan Phuong Mai Chau; Souhail Bakkali; Antoine Doucet

DocSum: Domain-Adaptive Pre-training for Document Abstractive Summarization

Phan Phuong Mai Chau, Souhail Bakkali, Antoine Doucet

TL;DR

The paper tackles the challenge of abstractive summarization for OCR'ed administrative documents, where noise and domain specificity hinder standard models. It introduces DocSum, a framework that combines domain-adaptive pre-training of BART on OCR-transcribed corpora with DAS-enriched prompting that incorporates LLM-generated gold summaries and question–answer pairs. Ground-truth data are generated via Mistal-7B-Instruct, filtered by confidence scores, and used to fine-tune the model on RVL-CDIP, with evaluation on Document Abstractive Summarization (DAS) and Document Text Classification (DTC). Results show that domain adaptation and QA-enhanced inputs yield improvements in summary quality (e.g., higher BERTScore) and classification accuracy, illustrating practical benefits for business and public-sector document workflows. The work highlights ongoing challenges in OCR noise and LLM hallucinations, and proposes directions for more diverse data, robust prompting, and improved reliability in real-world applications.

Abstract

Abstractive summarization has made significant strides in condensing and rephrasing large volumes of text into coherent summaries. However, summarizing administrative documents presents unique challenges due to domain-specific terminology, OCR-generated errors, and the scarcity of annotated datasets for model fine-tuning. Existing models often struggle to adapt to the intricate structure and specialized content of such documents. To address these limitations, we introduce DocSum, a domain-adaptive abstractive summarization framework tailored for administrative documents. Leveraging pre-training on OCR-transcribed text and fine-tuning with an innovative integration of question-answer pairs, DocSum enhances summary accuracy and relevance. This approach tackles the complexities inherent in administrative content, ensuring outputs that align with real-world business needs. To evaluate its capabilities, we define a novel downstream task setting-Document Abstractive Summarization-which reflects the practical requirements of business and organizational settings. Comprehensive experiments demonstrate DocSum's effectiveness in producing high-quality summaries, showcasing its potential to improve decision-making and operational workflows across the public and private sectors.

DocSum: Domain-Adaptive Pre-training for Document Abstractive Summarization

TL;DR

Abstract

DocSum: Domain-Adaptive Pre-training for Document Abstractive Summarization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)