Table of Contents
Fetching ...

Science Hierarchography: Hierarchical Organization of Science Literature

Muhan Gao, Jash Shah, Weiqi Wang, Kuan-Hao Huang, Daniel Khashabi

TL;DR

Science Hierarchography presents a scalable framework to hierarchically organize scientific literature across multiple levels of abstraction, from broad domains to individual papers. It introduces Scychic, a hybrid method that alternates embedding-based clustering with LLM prompting to build high-quality hierarchies while minimizing costly LLM calls. The evaluation uses a utilization-based framework with an LLM navigator to measure how efficiently users can locate target papers, showing improved interpretability over traditional search methods. The work demonstrates scalability on a 10K-paper corpus and provides code, data, and a live demo for reproducible research.

Abstract

Scientific knowledge is growing rapidly, making it difficult to track progress and high-level conceptual links across broad disciplines. While tools like citation networks and search engines help retrieve related papers, they lack the abstraction needed to capture the needed to represent the density and structure of activity across subfields. We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure that spans multiple levels of abstraction -- from broad domains to specific studies. Such a representation can provide insights into which fields are well-explored and which are under-explored. To achieve this goal, we develop a hybrid approach that combines efficient embedding-based clustering with LLM-based prompting, striking a balance between scalability and semantic precision. Compared to LLM-heavy methods like iterative tree construction, our approach achieves superior quality-speed trade-offs. Our hierarchies capture different dimensions of research contributions, reflecting the interdisciplinary and multifaceted nature of modern science. We evaluate its utility by measuring how effectively an LLM-based agent can navigate the hierarchy to locate target papers. Results show that our method improves interpretability and offers an alternative pathway for exploring scientific literature beyond traditional search methods. Code, data and demo are available: https://github.com/JHU-CLSP/science-hierarchography

Science Hierarchography: Hierarchical Organization of Science Literature

TL;DR

Science Hierarchography presents a scalable framework to hierarchically organize scientific literature across multiple levels of abstraction, from broad domains to individual papers. It introduces Scychic, a hybrid method that alternates embedding-based clustering with LLM prompting to build high-quality hierarchies while minimizing costly LLM calls. The evaluation uses a utilization-based framework with an LLM navigator to measure how efficiently users can locate target papers, showing improved interpretability over traditional search methods. The work demonstrates scalability on a 10K-paper corpus and provides code, data, and a live demo for reproducible research.

Abstract

Scientific knowledge is growing rapidly, making it difficult to track progress and high-level conceptual links across broad disciplines. While tools like citation networks and search engines help retrieve related papers, they lack the abstraction needed to capture the needed to represent the density and structure of activity across subfields. We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure that spans multiple levels of abstraction -- from broad domains to specific studies. Such a representation can provide insights into which fields are well-explored and which are under-explored. To achieve this goal, we develop a hybrid approach that combines efficient embedding-based clustering with LLM-based prompting, striking a balance between scalability and semantic precision. Compared to LLM-heavy methods like iterative tree construction, our approach achieves superior quality-speed trade-offs. Our hierarchies capture different dimensions of research contributions, reflecting the interdisciplinary and multifaceted nature of modern science. We evaluate its utility by measuring how effectively an LLM-based agent can navigate the hierarchy to locate target papers. Results show that our method improves interpretability and offers an alternative pathway for exploring scientific literature beyond traditional search methods. Code, data and demo are available: https://github.com/JHU-CLSP/science-hierarchography

Paper Structure

This paper contains 47 sections, 14 figures, 11 tables, 1 algorithm.

Figures (14)

  • Figure 1: An example of Science Hierarchography illustrates how scholarly work can be organized hierarchically---from broad research domains at the top, through increasingly specific sub-clusters, down to individual papers at the lowest level. Critically, this structure must be inferred automatically and at scale.
  • Figure 2: Prompt used for Evaluation
  • Figure 3: Prompt for extracting Problem/Solution/Result contributions
  • Figure 4: Prompt of Topic and Rationale Generation
  • Figure 5: Distribution of topics extracted from SciPile: (a) Top-50 topics, (b) Every 200 topics. Refer §\ref{['subsec:representing:papers']} for more information.
  • ...and 9 more figures