Table of Contents
Fetching ...

PaperTrail: A Claim-Evidence Interface for Grounding Provenance in LLM-based Scholarly Q&A

Anna Martin-Boyle, Cara A. C. Leckey, Martha C. Brown, Harmanpreet Kaur

TL;DR

This work introduces PaperTrail, a novel interface that decomposes both LLM answers and source documents into discrete claims and evidence, mapping them to reveal supported assertions, unsupported claims, and information omitted from the source texts.

Abstract

Large language models (LLMs) are increasingly used in scholarly question-answering (QA) systems to help researchers synthesize vast amounts of literature. However, these systems often produce subtle errors (e.g., unsupported claims, errors of omission), and current provenance mechanisms like source citations are not granular enough for the rigorous verification that scholarly domain requires. To address this, we introduce PaperTrail, a novel interface that decomposes both LLM answers and source documents into discrete claims and evidence, mapping them to reveal supported assertions, unsupported claims, and information omitted from the source texts. We evaluated PaperTrail in a within-subjects study with 26 researchers who performed two scholarly editing tasks using PaperTrail and a baseline interface. Our results show that PaperTrail significantly lowered participants' trust compared to the baseline. However, this increased caution did not translate to behavioral changes, as people continued to rely on LLM-generated scholarly edits to avoid a cognitively burdensome task. We discuss the value of claim-evidence matching for understanding LLM trustworthiness in scholarly settings, and present design implications for cognition-friendly communication of provenance information.

PaperTrail: A Claim-Evidence Interface for Grounding Provenance in LLM-based Scholarly Q&A

TL;DR

This work introduces PaperTrail, a novel interface that decomposes both LLM answers and source documents into discrete claims and evidence, mapping them to reveal supported assertions, unsupported claims, and information omitted from the source texts.

Abstract

Large language models (LLMs) are increasingly used in scholarly question-answering (QA) systems to help researchers synthesize vast amounts of literature. However, these systems often produce subtle errors (e.g., unsupported claims, errors of omission), and current provenance mechanisms like source citations are not granular enough for the rigorous verification that scholarly domain requires. To address this, we introduce PaperTrail, a novel interface that decomposes both LLM answers and source documents into discrete claims and evidence, mapping them to reveal supported assertions, unsupported claims, and information omitted from the source texts. We evaluated PaperTrail in a within-subjects study with 26 researchers who performed two scholarly editing tasks using PaperTrail and a baseline interface. Our results show that PaperTrail significantly lowered participants' trust compared to the baseline. However, this increased caution did not translate to behavioral changes, as people continued to rely on LLM-generated scholarly edits to avoid a cognitively burdensome task. We discuss the value of claim-evidence matching for understanding LLM trustworthiness in scholarly settings, and present design implications for cognition-friendly communication of provenance information.
Paper Structure (58 sections, 1 equation, 2 figures, 3 tables)

This paper contains 58 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The Argument Extraction Engine (top) provides three extraction methods with different computational cost and accuracy tradeoffs. Colors follow a computational cost gradient: red indicates LLM-based extraction (highest computational cost), gold represents RAG-based extraction (medium cost), and teal denotes similarity-based extraction (lowest cost). We deploy these methods strategically across pipeline stages based on design-time considerations: (1) offline paper-level information extraction that preprocesses research papers into structured claims and evidence; (2) real-time answer-level extraction that decomposes LLM-generated answers into claims and supporting evidence; and (3) real-time claim-evidence matching that uses retrieval-augmented generation (RAG) to filter and align relevant paper claims with answer claims, producing source provenance indicators.
  • Figure 2: The user interface comprises three main panels. The general layout, shown at the top, consists of the Left Panel (A), the Middle Panel (B), and the Right Panel (C). The Left Panel (A) contains the user's main workspace, including the Task Context (A1), a References List (A2), and the Text Editor (A3). The Middle Panel (B) serves as the Chat Interface (B1), which includes a Question Bank (B2) and Chat Controls (B3). The Right Panel (C) displays information provenance, with its content changing based on the condition. The lower half shows the differences between the conditions. In the PaperTrail interface (left), the Middle Panel (B) shows interactive Answer Claims (B4). When a user clicks a claim, the corresponding Paper Claim (C2) is highlighted in the Right Panel (C), which also shows the overall Claim Coverage (C1) for the LLM's answer. In the baseline interface (right), the Middle Panel (B) contains a sentence-level Source Highlight (B5). Clicking this highlight surfaces the verbatim Paper Source (C3) text in the Right Panel (C).