Table of Contents
Fetching ...

Comparative Analysis of OpenAI GPT-4o and DeepSeek R1 for Scientific Text Categorization Using Prompt Engineering

Aniruddha Maiti, Samuel Adewumi, Temesgen Alemayehu Tikure, Zichun Wang, Niladri Sengupta, Anastasiia Sukhanova, Ananya Jana

TL;DR

This study compares GPT-4o and DeepSeek R1 on scientific sentence categorization using prompt engineering and a task-specific dataset of ten arXiv papers. It introduces a paragraph-level prompt framework with 17 predefined relationship categories and evaluates performance in terms of categorization coverage, entity extraction, and cross-model agreement. The findings show DeepSeek R1 achieves higher coverage and some category-level consistency, while GPT-4o exhibits more misclassification and inconsistent boundaries, likely reflecting different heuristics. The work highlights the importance of prompt design and dataset scale for reliable scientific text categorization and suggests directions for improving open-source tooling and evaluation methodologies. The practical impact lies in informing researchers about the trade-offs between a commercial multimodal model and an open-source alternative for structured scientific text understanding.

Abstract

This study examines how large language models categorize sentences from scientific papers using prompt engineering. We use two advanced web-based models, GPT-4o (by OpenAI) and DeepSeek R1, to classify sentences into predefined relationship categories. DeepSeek R1 has been tested on benchmark datasets in its technical report. However, its performance in scientific text categorization remains unexplored. To address this gap, we introduce a new evaluation method designed specifically for this task. We also compile a dataset of cleaned scientific papers from diverse domains. This dataset provides a platform for comparing the two models. Using this dataset, we analyze their effectiveness and consistency in categorization.

Comparative Analysis of OpenAI GPT-4o and DeepSeek R1 for Scientific Text Categorization Using Prompt Engineering

TL;DR

This study compares GPT-4o and DeepSeek R1 on scientific sentence categorization using prompt engineering and a task-specific dataset of ten arXiv papers. It introduces a paragraph-level prompt framework with 17 predefined relationship categories and evaluates performance in terms of categorization coverage, entity extraction, and cross-model agreement. The findings show DeepSeek R1 achieves higher coverage and some category-level consistency, while GPT-4o exhibits more misclassification and inconsistent boundaries, likely reflecting different heuristics. The work highlights the importance of prompt design and dataset scale for reliable scientific text categorization and suggests directions for improving open-source tooling and evaluation methodologies. The practical impact lies in informing researchers about the trade-offs between a commercial multimodal model and an open-source alternative for structured scientific text understanding.

Abstract

This study examines how large language models categorize sentences from scientific papers using prompt engineering. We use two advanced web-based models, GPT-4o (by OpenAI) and DeepSeek R1, to classify sentences into predefined relationship categories. DeepSeek R1 has been tested on benchmark datasets in its technical report. However, its performance in scientific text categorization remains unexplored. To address this gap, we introduce a new evaluation method designed specifically for this task. We also compile a dataset of cleaned scientific papers from diverse domains. This dataset provides a platform for comparing the two models. Using this dataset, we analyze their effectiveness and consistency in categorization.

Paper Structure

This paper contains 23 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Agreement rates for different relationship categories assigned by GPT-4o and DeepSeek R1.
  • Figure 2: Pairwise agreement heatmap between GPT-4o and DeepSeek R1 category assignments. The heatmap shows how frequently both models assigned the same category to the same sentence.
  • Figure 3: Entity agreement rates for different relationship categories between GPT-4o and DeepSeek R1.