Table of Contents
Fetching ...

CHIME: LLM-Assisted Hierarchical Organization of Scientific Studies for Literature Review Support

Chao-Chun Hsu, Erin Bransom, Jenna Sparks, Bailey Kuehl, Chenhao Tan, David Wadden, Lucy Lu Wang, Aakanksha Naik

TL;DR

This paper investigates using large language models to generate hierarchical organizations of scientific studies to aid literature reviews, introducing CHIME (Constructing HIerarchies of bioMedical Evidence) and a human-in-the-loop workflow. An LLM-based pipeline creates preliminary hierarchies from 472 Cochrane review sets, producing 2,174 hierarchies, with expert corrections on 320 hierarchies across 100 topics, enabling quantitative evaluation of category generation versus study-to-category assignment. Results show strong performance in generating and linking categories (high parent-child accuracy and sibling coherence) but weaker study assignment to categories, which a trained corrector model can substantially improve by 12.6 F1 points. The work releases CHIME and associated models to spur development of better assistive tools for literature review, while noting domain specificity, deployment latency, and reliance on curated inputs as current limitations. The corrector results, especially for claim categorization recall improvements, indicate a promising direction for automating parts of the review workflow alongside expert oversight.

Abstract

Literature review requires researchers to synthesize a large amount of information and is increasingly challenging as the scientific literature expands. In this work, we investigate the potential of LLMs for producing hierarchical organizations of scientific studies to assist researchers with literature review. We define hierarchical organizations as tree structures where nodes refer to topical categories and every node is linked to the studies assigned to that category. Our naive LLM-based pipeline for hierarchy generation from a set of studies produces promising yet imperfect hierarchies, motivating us to collect CHIME, an expert-curated dataset for this task focused on biomedicine. Given the challenging and time-consuming nature of building hierarchies from scratch, we use a human-in-the-loop process in which experts correct errors (both links between categories and study assignment) in LLM-generated hierarchies. CHIME contains 2,174 LLM-generated hierarchies covering 472 topics, and expert-corrected hierarchies for a subset of 100 topics. Expert corrections allow us to quantify LLM performance, and we find that while they are quite good at generating and organizing categories, their assignment of studies to categories could be improved. We attempt to train a corrector model with human feedback which improves study assignment by 12.6 F1 points. We release our dataset and models to encourage research on developing better assistive tools for literature review.

CHIME: LLM-Assisted Hierarchical Organization of Scientific Studies for Literature Review Support

TL;DR

This paper investigates using large language models to generate hierarchical organizations of scientific studies to aid literature reviews, introducing CHIME (Constructing HIerarchies of bioMedical Evidence) and a human-in-the-loop workflow. An LLM-based pipeline creates preliminary hierarchies from 472 Cochrane review sets, producing 2,174 hierarchies, with expert corrections on 320 hierarchies across 100 topics, enabling quantitative evaluation of category generation versus study-to-category assignment. Results show strong performance in generating and linking categories (high parent-child accuracy and sibling coherence) but weaker study assignment to categories, which a trained corrector model can substantially improve by 12.6 F1 points. The work releases CHIME and associated models to spur development of better assistive tools for literature review, while noting domain specificity, deployment latency, and reliance on curated inputs as current limitations. The corrector results, especially for claim categorization recall improvements, indicate a promising direction for automating parts of the review workflow alongside expert oversight.

Abstract

Literature review requires researchers to synthesize a large amount of information and is increasingly challenging as the scientific literature expands. In this work, we investigate the potential of LLMs for producing hierarchical organizations of scientific studies to assist researchers with literature review. We define hierarchical organizations as tree structures where nodes refer to topical categories and every node is linked to the studies assigned to that category. Our naive LLM-based pipeline for hierarchy generation from a set of studies produces promising yet imperfect hierarchies, motivating us to collect CHIME, an expert-curated dataset for this task focused on biomedicine. Given the challenging and time-consuming nature of building hierarchies from scratch, we use a human-in-the-loop process in which experts correct errors (both links between categories and study assignment) in LLM-generated hierarchies. CHIME contains 2,174 LLM-generated hierarchies covering 472 topics, and expert-corrected hierarchies for a subset of 100 topics. Expert corrections allow us to quantify LLM performance, and we find that while they are quite good at generating and organizing categories, their assignment of studies to categories could be improved. We attempt to train a corrector model with human feedback which improves study assignment by 12.6 F1 points. We release our dataset and models to encourage research on developing better assistive tools for literature review.
Paper Structure (43 sections, 6 figures, 4 tables)

This paper contains 43 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Given a set of related studies on a topic, we use LLMs to identify top-level categories focusing on different views of the data (such as P1 and P2), generate multiple hierarchical organizations, and assign studies to different categories. However, these categories and study assignments can contain errors. As illustrated in the figure, the categories Walking and Weight training are not coherent with their siblings ($S1-S3$) in hierarchy 1 since they are more specific, and the categories Metastasis and Recurrence are incorrectly assigned to the parent category in hierarchy 2 since they are not types of cancer.
  • Figure 2: LLM-based pipeline for preliminary hierarchy generation given a set of related studies on a topic.
  • Figure 3: Claim generation prompt for GPT-3.5 Turbo.
  • Figure 4: Hierarchy proposal module prompt for Claude-2.
  • Figure 5: Prompt for task 1 sibling coherence for both GPT-3.5 Turbo and GPT-4 Turbo.
  • ...and 1 more figures