Table of Contents
Fetching ...

PDFTriage: Question Answering over Long, Structured Documents

Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, David Seunghyun Yoon, Ryan A. Rossi, Franck Dernoncourt

TL;DR

PDFTriage addresses the challenge of answering questions over long, structured documents by leveraging document metadata and a set of model-callable retrieval functions to fetch content from precise structural units. The method integrates an LLM-based triage that selects relevant frames (pages, sections, figures, tables) and retrieves content before generating answers, outperforming plain-text retrieval baselines. A new dataset of about 900 questions across 82 documents and 10 question-types supports evaluation of structure- and content-aware QA, with human evaluators favoring PDFTriage for multi-page, structure-aware tasks. The work demonstrates that structured retrieval reduces token usage while maintaining or improving answer quality and shows robustness across document lengths, suggesting practical benefits for scalable, document-grounded QA systems.

Abstract

Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA. Our code and datasets will be released soon on Github.

PDFTriage: Question Answering over Long, Structured Documents

TL;DR

PDFTriage addresses the challenge of answering questions over long, structured documents by leveraging document metadata and a set of model-callable retrieval functions to fetch content from precise structural units. The method integrates an LLM-based triage that selects relevant frames (pages, sections, figures, tables) and retrieves content before generating answers, outperforming plain-text retrieval baselines. A new dataset of about 900 questions across 82 documents and 10 question-types supports evaluation of structure- and content-aware QA, with human evaluators favoring PDFTriage for multi-page, structure-aware tasks. The work demonstrates that structured retrieval reduces token usage while maintaining or improving answer quality and shows robustness across document lengths, suggesting practical benefits for scalable, document-grounded QA systems.

Abstract

Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA. Our code and datasets will be released soon on Github.
Paper Structure (31 sections, 17 figures, 7 tables)

This paper contains 31 sections, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Overview of the PDFTriage technique: PDFTriage leverages a PDF's structured metadata to implement a more precise and accurate document question-answering approach. It starts by generating a structured metadata representation of the document, extracting information surrounding section text, figure captions, headers, and tables. Next, given a query, a LLM-based Triage selects the document frame needed for answering the query and retrieves it directly from the selected page, section, figure, or table. Finally, the selected context and inputted query are processed by the LLM before the generated answer is outputted.
  • Figure 2: PDFTriage Document Distribution by Word Count
  • Figure 3: User Preferences between PDFTriage and Alternate Approaches: Overall, PDFTriage-generated answers were favored the most by the users, claiming 50.8% of the top-ranked answers overall. Furthermore, PDFTriage answers ranked higher on certain multi-page tasks, such as structure questions and table reasoning, while ranking lower on generalized textual tasks, such as classification and text questions. However, across all the question categories, PDFTriage beat both the Page Retrieval and Chunk Retrieval approaches on a head-to-head ranking.
  • Figure 4: PDFTriage Performance compared to Document Page Length (uses "Overall Quality" scores)
  • Figure 5: Annotation Question #1
  • ...and 12 more figures