Table of Contents
Fetching ...

A Comparison of DeepSeek and Other LLMs

Tianchen Gao, Jiashun Jin, Zheng Tracy Ke, Gabriel Moryoussef

TL;DR

This work benchmarks DeepSeek against Claude, Gemini, GPT, and Llama on two prediction tasks—Authorship classification and Citation classification—using newly constructed datasets MADStat and CitaStat. It finds Claude generally offers the highest accuracy, while DeepSeek trades some accuracy for lower cost and slower runtimes, and it documents strong cross-model similarities in outputs. The paper also introduces a data-generation recipe and fully labeled datasets (MADStatAI, CitaStat) to enable benchmark-driven evaluation, and explores temporal stability as well as a hybrid approach that combines Higher Criticism with LLMs to surpass pure LLM performance. Collectively, the work provides practical benchmarks, insights into model complementarity, and directions for future AI-content evaluation and data-generation research.

Abstract

Recently, DeepSeek has been the focus of attention in and beyond the AI community. An interesting problem is how DeepSeek compares to other large language models (LLMs). There are many tasks an LLM can do, and in this paper, we use the task of "predicting an outcome using a short text" for comparison. We consider two settings, an authorship classification setting and a citation classification setting. In the first one, the goal is to determine whether a short text is written by human or AI. In the second one, the goal is to classify a citation to one of four types using the textual content. For each experiment, we compare DeepSeek with $4$ popular LLMs: Claude, Gemini, GPT, and Llama. We find that, in terms of classification accuracy, DeepSeek outperforms Gemini, GPT, and Llama in most cases, but underperforms Claude. We also find that DeepSeek is comparably slower than others but with a low cost to use, while Claude is much more expensive than all the others. Finally, we find that in terms of similarity, the output of DeepSeek is most similar to those of Gemini and Claude (and among all $5$ LLMs, Claude and Gemini have the most similar outputs). In this paper, we also present a fully-labeled dataset collected by ourselves, and propose a recipe where we can use the LLMs and a recent data set, MADStat, to generate new data sets. The datasets in our paper can be used as benchmarks for future study on LLMs.

A Comparison of DeepSeek and Other LLMs

TL;DR

This work benchmarks DeepSeek against Claude, Gemini, GPT, and Llama on two prediction tasks—Authorship classification and Citation classification—using newly constructed datasets MADStat and CitaStat. It finds Claude generally offers the highest accuracy, while DeepSeek trades some accuracy for lower cost and slower runtimes, and it documents strong cross-model similarities in outputs. The paper also introduces a data-generation recipe and fully labeled datasets (MADStatAI, CitaStat) to enable benchmark-driven evaluation, and explores temporal stability as well as a hybrid approach that combines Higher Criticism with LLMs to surpass pure LLM performance. Collectively, the work provides practical benchmarks, insights into model complementarity, and directions for future AI-content evaluation and data-generation research.

Abstract

Recently, DeepSeek has been the focus of attention in and beyond the AI community. An interesting problem is how DeepSeek compares to other large language models (LLMs). There are many tasks an LLM can do, and in this paper, we use the task of "predicting an outcome using a short text" for comparison. We consider two settings, an authorship classification setting and a citation classification setting. In the first one, the goal is to determine whether a short text is written by human or AI. In the second one, the goal is to classify a citation to one of four types using the textual content. For each experiment, we compare DeepSeek with popular LLMs: Claude, Gemini, GPT, and Llama. We find that, in terms of classification accuracy, DeepSeek outperforms Gemini, GPT, and Llama in most cases, but underperforms Claude. We also find that DeepSeek is comparably slower than others but with a low cost to use, while Claude is much more expensive than all the others. Finally, we find that in terms of similarity, the output of DeepSeek is most similar to those of Gemini and Claude (and among all LLMs, Claude and Gemini have the most similar outputs). In this paper, we also present a fully-labeled dataset collected by ourselves, and propose a recipe where we can use the LLMs and a recent data set, MADStat, to generate new data sets. The datasets in our paper can be used as benchmarks for future study on LLMs.

Paper Structure

This paper contains 11 sections, 1 equation, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of the lengths of human-generated and AI-generated abstracts. The x-axis is the length of an original abstract, and the y-axis is the length of its AI counterpart (left panel) or humAI counterpart (right panel).
  • Figure 2: The boxplots of per-author classification errors.
  • Figure 3: The prediction agreement among 5 LLMs in detecting AI from human texts. Left: 'human versus AI' (AC1). Right: 'human versus humAI' (AC2). Take the cell on the first row and second column (left panel) for example: for $64\%$ of the samples, the predicted outcomes by Claude-3.5-sonnet and DeepSeek-R1 are exactly the same.
  • Figure 4: The prompt for 2-class citation classification, where [Reference Key] is the phrase in the text representing this reference, and [Example 1] is an example text from Background (other categories are similar). The prompt for 4-class classification is similar, except that the sentence "Furthermore, we consider ..." is removed and the last sentence is changed to "Please reply only with one of the following: Background, Comparison, Fundamental idea, or Technical basis."
  • Figure 5: The prediction agreement among 6 LLMs in citation classification. Left: 4-class citation classification (CC1). Right: 2-class citation classification (CC2). Take the cell on the first row and second column (left panel) for example: for $73\%$ of the samples, the predicted outcomes by Claude-3.5-sonnet and DeepSeek-V3 are exactly the same.
  • ...and 2 more figures