Table of Contents
Fetching ...

Using language models to label clusters of scientific documents

Dakota Murray, Chaoqun Ni, Weiye Gu, Trevor Hubbard

TL;DR

The descriptive label generation task is addressed, an empirical basis for the use of language models is established, and a framework to guide future design and evaluation efforts is provided.

Abstract

Automated label generation for clusters of scientific documents is a common task in bibliometric workflows. Traditionally, labels were formed by concatenating distinguishing characteristics of a cluster's documents; while straightforward, this approach often produces labels that are terse and difficult to interpret. The advent and widespread accessibility of generative language models, such as ChatGPT, make it possible to automatically generate descriptive and human-readable labels that closely resemble those assigned by human annotators. Language-model label generation has already seen widespread use in bibliographic databases and analytical workflows. However, its rapid adoption has outpaced the theoretical, practical, and empirical foundations. In this study, we address the automated label generation task and make four key contributions: (1) we define two distinct types of labels: characteristic and descriptive, and contrast descriptive labeling with related tasks; (2) we provide a formal descriptive labeling that clarifies important steps and design considerations; (3) we propose a structured workflow for label generation and outline practical considerations for its use in bibliometric workflows; and (4) we develop an evaluative framework to assess descriptive labels generated by language models and demonstrate that they perform at or near characteristic labels, and highlight design considerations for their use. Together, these contributions clarify the descriptive label generation task, establish an empirical basis for the use of language models, and provide a framework to guide future design and evaluation efforts.

Using language models to label clusters of scientific documents

TL;DR

The descriptive label generation task is addressed, an empirical basis for the use of language models is established, and a framework to guide future design and evaluation efforts is provided.

Abstract

Automated label generation for clusters of scientific documents is a common task in bibliometric workflows. Traditionally, labels were formed by concatenating distinguishing characteristics of a cluster's documents; while straightforward, this approach often produces labels that are terse and difficult to interpret. The advent and widespread accessibility of generative language models, such as ChatGPT, make it possible to automatically generate descriptive and human-readable labels that closely resemble those assigned by human annotators. Language-model label generation has already seen widespread use in bibliographic databases and analytical workflows. However, its rapid adoption has outpaced the theoretical, practical, and empirical foundations. In this study, we address the automated label generation task and make four key contributions: (1) we define two distinct types of labels: characteristic and descriptive, and contrast descriptive labeling with related tasks; (2) we provide a formal descriptive labeling that clarifies important steps and design considerations; (3) we propose a structured workflow for label generation and outline practical considerations for its use in bibliometric workflows; and (4) we develop an evaluative framework to assess descriptive labels generated by language models and demonstrate that they perform at or near characteristic labels, and highlight design considerations for their use. Together, these contributions clarify the descriptive label generation task, establish an empirical basis for the use of language models, and provide a framework to guide future design and evaluation efforts.

Paper Structure

This paper contains 22 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Example of "map of science" comparing cluster labels. Shown is a 2-dimension projection of an embedding demonstrating the practical difference between characteristic (left) and descriptive labels (right). Clusters are identified for the "Botany" cluster described in the Methods. We first create embeddings based on publication text (title and abstract) which are passed to Lingo4G---a software product aimed at generating maps of documents---and the embeddings are reduced to 2 dimensions for visualization using UMAP mcinnes2018umap.
  • Figure 2: Diagram of our general approach to descriptive label generation. Illustrates the operation $\text{GenerateLabel}(F_i, \text{model}, \text{template}, \gamma) \rightarrow l_i$. Assume that there exists a set of papers, $P$ that have been mapped to clusters $C = [c_1, c_2, c_3, \ldots]$, and that from these clusters prominent characteristics have already been surmised. (A) begin with $F$, which lists the top characteristics for each cluster. (B)-C Defines a prompt template, and the process by which characteristics are encoded in the template through what we term "clauses". (D) A prompt is created for each cluster. (E) For each prompt, a language model is queried and output labels collected. The query includes additional model-specific parameters, $\gamma$. (F) The core of the iterative labeling approach. Each label is assessed based on certain checks, representing the operation $\text{Validate}(L) \rightarrow L'$. Here, specific validation criteria include whether the label is locally valid (e.g., were instructions followed), whether it is duplicated (appearing multiple times), and whether the label is appropriately specific. Labels that fail this validation are represented as $L'$, and are re-generated until all labels pass validation. (G) The result of this procedure is the final labeled set of clusters, $L$.