Transforming Role Classification in Scientific Teams Using LLMs and Advanced Predictive Analytics

Wonduk Seo; Yi Bu

Transforming Role Classification in Scientific Teams Using LLMs and Advanced Predictive Analytics

Wonduk Seo, Yi Bu

TL;DR

This work tackles the challenge of classifying author roles in scientific teams beyond self-reports and static clustering by employing large language models (LLMs) and predictive analytics. It combines few-shot prompting on GPT-4 and open-source LLMs to produce fine-grained role labels (Leadership, Direct Support, Indirect Support), then trains a scalable dense neural network on ten OpenAlex-derived features to classify roles efficiently at scale. The study reports GPT-4 as the most accurate LLM for role labeling, achieves an F1 of approximately 0.76 with the predictive model, and uses SHAP to reveal that Probability of Leading and related features are pivotal for leadership identification. The approach promises scalable, context-aware analysis of team dynamics and leadership distribution, with implications for monitoring collaboration patterns and informing research management, while recognizing limitations in accessibility and data coverage of OpenAlex.

Abstract

Scientific team dynamics are critical in determining the nature and impact of research outputs. However, existing methods for classifying author roles based on self-reports and clustering lack comprehensive contextual analysis of contributions. Thus, we present a transformative approach to classifying author roles in scientific teams using advanced large language models (LLMs), which offers a more refined analysis compared to traditional clustering methods. Specifically, we seek to complement and enhance these traditional methods by utilizing open source and proprietary LLMs, such as GPT-4, Llama3 70B, Llama2 70B, and Mistral 7x8B, for role classification. Utilizing few-shot prompting, we categorize author roles and demonstrate that GPT-4 outperforms other models across multiple categories, surpassing traditional approaches such as XGBoost and BERT. Our methodology also includes building a predictive deep learning model using 10 features. By training this model on a dataset derived from the OpenAlex database, which provides detailed metadata on academic publications -- such as author-publication history, author affiliation, research topics, and citation counts -- we achieve an F1 score of 0.76, demonstrating robust classification of author roles.

Transforming Role Classification in Scientific Teams Using LLMs and Advanced Predictive Analytics

TL;DR

Abstract

Paper Structure (22 sections, 4 equations, 5 figures, 3 tables)

This paper contains 22 sections, 4 equations, 5 figures, 3 tables.

Introduction
Related Works
Dynamics of Scientific Teams
Traditional Methods of Role Classification
Large Language Models (LLMs) and Application in Science of Science
Dataset
LLM-Based Role Classification: Methods and Evaluation
Overview of LLM-Based Role Classification Tasks
Prompt Engineering for Role Classification
LLM Model Configuration and Performance Comparison
Comparison of LLMs with Traditional Models
Scalable Predictive Modeling for Author Role Classification
Dataset Construction and Feature Engineering
Data Splitting and Normalization
Predictive Model Training and Performance
...and 7 more sections

Figures (5)

Figure 1: Illustration of the workflow involving data sampling, preprocessing, contribution assigning, and classification task for the LLM-based role classification.
Figure 2: Illustration of the workflow involving prompt generation, LLM inference, and classification results.
Figure 3: Label distribution comparison across models. The figure illustrates the distribution of assigned roles—Leadership, Direct Support, and Indirect Support—by four models: GPT-4, Llama3 70B, Llama2 70B, and Mistral 7x8B.
Figure 4: Line plot for F1-score by class for each model.
Figure 5: SHAP summary plot showing feature importance and the directional impact of each feature on the model's predictions.

Transforming Role Classification in Scientific Teams Using LLMs and Advanced Predictive Analytics

TL;DR

Abstract

Transforming Role Classification in Scientific Teams Using LLMs and Advanced Predictive Analytics

Authors

TL;DR

Abstract

Table of Contents

Figures (5)