Comparative Analysis of OpenAI GPT-4o and DeepSeek R1 for Scientific Text Categorization Using Prompt Engineering
Aniruddha Maiti, Samuel Adewumi, Temesgen Alemayehu Tikure, Zichun Wang, Niladri Sengupta, Anastasiia Sukhanova, Ananya Jana
TL;DR
This study compares GPT-4o and DeepSeek R1 on scientific sentence categorization using prompt engineering and a task-specific dataset of ten arXiv papers. It introduces a paragraph-level prompt framework with 17 predefined relationship categories and evaluates performance in terms of categorization coverage, entity extraction, and cross-model agreement. The findings show DeepSeek R1 achieves higher coverage and some category-level consistency, while GPT-4o exhibits more misclassification and inconsistent boundaries, likely reflecting different heuristics. The work highlights the importance of prompt design and dataset scale for reliable scientific text categorization and suggests directions for improving open-source tooling and evaluation methodologies. The practical impact lies in informing researchers about the trade-offs between a commercial multimodal model and an open-source alternative for structured scientific text understanding.
Abstract
This study examines how large language models categorize sentences from scientific papers using prompt engineering. We use two advanced web-based models, GPT-4o (by OpenAI) and DeepSeek R1, to classify sentences into predefined relationship categories. DeepSeek R1 has been tested on benchmark datasets in its technical report. However, its performance in scientific text categorization remains unexplored. To address this gap, we introduce a new evaluation method designed specifically for this task. We also compile a dataset of cleaned scientific papers from diverse domains. This dataset provides a platform for comparing the two models. Using this dataset, we analyze their effectiveness and consistency in categorization.
