Table of Contents
Fetching ...

ReportLogic: Evaluating Logical Quality in Deep Research Reports

Jujia Zhao, Zhaoxin Huan, Zihan Wang, Xiaolu Zhang, Jun Zhou, Suzan Verberne, Zhaochun Ren

TL;DR

The paper addresses the gap in evaluating long-form logical quality for Deep Research reports generated by LLMs, arguing that factual correctness or fluency alone is insufficient for downstream use. It introduces ReportLogic, a reader-centric benchmark with a three-layer taxonomy (Macro-Logic, Expositional-Logic, Structural-Logic) and eight diagnostic dimensions, instantiated via context-aware rubrics. To enable scalable assessment, the authors train LogicJudge, an open-source rubric-guided judge, using distilled supervision and a two-stage alignment (SFT then GRPO) and validate it with human annotations across multiple domains. They perform extensive robustness analyses, including adversarial attacks, revealing that off-the-shelf judges are vulnerable to superficial cues and that reasoning-focused models can mask underlying logical gaps. Together, these contributions offer actionable guidance for building more robust logic evaluators and improving the logical reliability of LLM-generated long-form reports, which is essential for trustworthy AI-assisted deep research.

Abstract

Users increasingly rely on Large Language Models (LLMs) for Deep Research, using them to synthesize diverse sources into structured reports that support understanding and action. In this context, the practical reliability of such reports hinges on logical quality: whether the report's claims and arguments are explicitly supported and can be trusted as a basis for downstream use, rather than merely appearing fluent or informative. However, current evaluation frameworks largely overlook this requirement. To bridge this gap, we introduce ReportLogic, a benchmark that quantifies report-level logical quality through a reader-centric lens of auditability. Specifically, ReportLogic adopts a hierarchical taxonomy that evaluates whether readers can (1) trace an on-topic report structure with a unified analytical arc (Macro-Logic), (2) understand the progression with necessary context (Expositional-Logic), and (3) verify conclusions via explicit claim--support (Structural-Logic). Based on this taxonomy, we construct a human-annotated rubric-guided dataset and train an open-source LogicJudge for scalable evaluation. We further evaluate judge robustness via adversarial attacks, showing that off-the-shelf LLM judges are frequently influenced by superficial cues (e.g., verbosity), and reasoning modes can mask broken support relations. Overall, our results provide actionable guidance for building more robust logic evaluators and improving the logical reliability of LLM-generated reports.

ReportLogic: Evaluating Logical Quality in Deep Research Reports

TL;DR

The paper addresses the gap in evaluating long-form logical quality for Deep Research reports generated by LLMs, arguing that factual correctness or fluency alone is insufficient for downstream use. It introduces ReportLogic, a reader-centric benchmark with a three-layer taxonomy (Macro-Logic, Expositional-Logic, Structural-Logic) and eight diagnostic dimensions, instantiated via context-aware rubrics. To enable scalable assessment, the authors train LogicJudge, an open-source rubric-guided judge, using distilled supervision and a two-stage alignment (SFT then GRPO) and validate it with human annotations across multiple domains. They perform extensive robustness analyses, including adversarial attacks, revealing that off-the-shelf judges are vulnerable to superficial cues and that reasoning-focused models can mask underlying logical gaps. Together, these contributions offer actionable guidance for building more robust logic evaluators and improving the logical reliability of LLM-generated long-form reports, which is essential for trustworthy AI-assisted deep research.

Abstract

Users increasingly rely on Large Language Models (LLMs) for Deep Research, using them to synthesize diverse sources into structured reports that support understanding and action. In this context, the practical reliability of such reports hinges on logical quality: whether the report's claims and arguments are explicitly supported and can be trusted as a basis for downstream use, rather than merely appearing fluent or informative. However, current evaluation frameworks largely overlook this requirement. To bridge this gap, we introduce ReportLogic, a benchmark that quantifies report-level logical quality through a reader-centric lens of auditability. Specifically, ReportLogic adopts a hierarchical taxonomy that evaluates whether readers can (1) trace an on-topic report structure with a unified analytical arc (Macro-Logic), (2) understand the progression with necessary context (Expositional-Logic), and (3) verify conclusions via explicit claim--support (Structural-Logic). Based on this taxonomy, we construct a human-annotated rubric-guided dataset and train an open-source LogicJudge for scalable evaluation. We further evaluate judge robustness via adversarial attacks, showing that off-the-shelf LLM judges are frequently influenced by superficial cues (e.g., verbosity), and reasoning modes can mask broken support relations. Overall, our results provide actionable guidance for building more robust logic evaluators and improving the logical reliability of LLM-generated reports.
Paper Structure (73 sections, 4 equations, 12 figures, 3 tables)

This paper contains 73 sections, 4 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Comparison between existing evaluation views and ReportLogic on a Deep Research report.
  • Figure 2: ReportLogic framework. We define logical quality as auditability and decompose it into a three-layer taxonomy with eight dimensions. Given a query and paired reports, a rubric generator instantiates each dimension into context-aware inspection items that guide pairwise human annotation.
  • Figure 3: ReportLogic Leaderboard. Heatmap of win-rates for 16 frontier models across three domains. Darker colors indicate higher win-rates (stronger logical quality), while lighter colors indicate lower win-rates. Columns are ordered and grouped as follows: the first three dimensions correspond to Macro-Logic, the next two to Expositional-Logic, the following three to Structural-Logic, and the final column reports Overall.
  • Figure 4: Ablation Study on Rubric Effectiveness.
  • Figure 5: Attack analysis of judge robustness. (a) Targeted-dimension Attack Success Rate (ASR): fraction of cases where the judge prefers the attacked response, with lower indicating better robustness. (b) Bias-type ASR: fraction of cases where the judge prefers a logically equivalent response with surface manipulations.
  • ...and 7 more figures