Table of Contents
Fetching ...

Artificial Intuition: Efficient Classification of Scientific Abstracts

Harsh Sakhrani, Naseela Pervez, Anirudh Ravi Kumar, Fred Morstatter, Alexandra Graddy Reed, Andrea Belz

TL;DR

The paper tackles the challenge of coarse-grained classification of short scientific abstracts. It introduces artificial intuition, a workflow that uses a Large Language Model to generate context-specific keyword metadata, combines YAKE keyword extraction, LLM metadata, and Sentence Transformer embeddings, and applies $k$-means clustering to form a coarse label space of size $\hat{k}$. Two novel metrics, redundancy $\mathcal{R}$ and coverage $\mathcal{S}$, quantify label-space orthogonality and document-space spanning, and the authors show that augmenting abstracts with LLM-derived metadata improves precision and F1 in the retrieval-based label assignment. Evaluated on NASA SBIR abstracts, the results indicate practical applicability for portfolio management and potential generalization to longer texts and other domains.

Abstract

It is desirable to coarsely classify short scientific texts, such as grant or publication abstracts, for strategic insight or research portfolio management. These texts efficiently transmit dense information to experts possessing a rich body of knowledge to aid interpretation. Yet this task is remarkably difficult to automate because of brevity and the absence of context. To address this gap, we have developed a novel approach to generate and appropriately assign coarse domain-specific labels. We show that a Large Language Model (LLM) can provide metadata essential to the task, in a process akin to the augmentation of supplemental knowledge representing human intuition, and propose a workflow. As a pilot study, we use a corpus of award abstracts from the National Aeronautics and Space Administration (NASA). We develop new assessment tools in concert with established performance metrics.

Artificial Intuition: Efficient Classification of Scientific Abstracts

TL;DR

The paper tackles the challenge of coarse-grained classification of short scientific abstracts. It introduces artificial intuition, a workflow that uses a Large Language Model to generate context-specific keyword metadata, combines YAKE keyword extraction, LLM metadata, and Sentence Transformer embeddings, and applies -means clustering to form a coarse label space of size . Two novel metrics, redundancy and coverage , quantify label-space orthogonality and document-space spanning, and the authors show that augmenting abstracts with LLM-derived metadata improves precision and F1 in the retrieval-based label assignment. Evaluated on NASA SBIR abstracts, the results indicate practical applicability for portfolio management and potential generalization to longer texts and other domains.

Abstract

It is desirable to coarsely classify short scientific texts, such as grant or publication abstracts, for strategic insight or research portfolio management. These texts efficiently transmit dense information to experts possessing a rich body of knowledge to aid interpretation. Yet this task is remarkably difficult to automate because of brevity and the absence of context. To address this gap, we have developed a novel approach to generate and appropriately assign coarse domain-specific labels. We show that a Large Language Model (LLM) can provide metadata essential to the task, in a process akin to the augmentation of supplemental knowledge representing human intuition, and propose a workflow. As a pilot study, we use a corpus of award abstracts from the National Aeronautics and Space Administration (NASA). We develop new assessment tools in concert with established performance metrics.
Paper Structure (14 sections, 5 equations, 6 figures, 3 tables)

This paper contains 14 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 2: Variation of F1 score with the number of keywords at the threshold of top 1%.
  • Figure 3: Variation of redundancy $\mathcal{R}$ with the number of clusters $\hat{k}$.
  • Figure 4: Analysis workflow and use of the coverage matrix $\mathcal{W}$. In one application (final step in green), the element with the maximum value is used to generate the Coverage. The second usage (blue final step) is to extract those values exceeding a specific threshold $T$ for the label prediction task.
  • Figure 5: Variation of coverage $S$ with $\hat{k}$.
  • Figure 6: Variation of F1 scores for assigned labels with weights $w$ exceeding the percentile threshold $T$, as defined in the text.
  • ...and 1 more figures