Table of Contents
Fetching ...

Quantifying the Capability Boundary of DeepSeek Models: An Application-Driven Performance Analysis

Kaikai Zhao, Zhaoxiang Liu, Xuejiao Lei, Jiaojiao Zhao, Zhenhong Long, Zipeng Wang, Ning Wang, Meijuan An, Qingliang Meng, Peijun Yang, Minjie Hua, Chaoyang Ma, Wen Liu, Kai Wang, Shiguo Lian

TL;DR

This work addresses the gap between benchmark SOTA performance and real-world deployment for DeepSeek models by evaluating a wide range of variants (including 4-bit quantized and QwQ-32B) on an enhanced, application-driven benchmark, A-Eval-2.0. It introduces a rigorous automatic scoring pipeline with two-phase expert validation and provides a multi-faceted analysis across model scale, reasoning enhancements, distillation, and quantization. Key contributions include open-source A-Eval-2.0, quantified capability boundaries via a model-tier framework, and a practical model selection notebook to guide cost-effective deployment in real-world tasks. The findings reveal nuanced tradeoffs: larger models generally perform better, reasoning enhancements boost logical reasoning and planning but can impair other tasks, and distillation/quantization effects vary by task, informing practitioners on when and what to deploy for given applications.

Abstract

DeepSeek-R1, known for its low training cost and exceptional reasoning capabilities, has achieved state-of-the-art performance on various benchmarks. However, detailed evaluations for DeepSeek Series models from the perspective of real-world applications are lacking, making it challenging for users to select the most suitable DeepSeek models for their specific needs. To address this gap, we presents the first comprehensive evaluation of the DeepSeek and its related models (including DeepSeek-V3, DeepSeek-R1, DeepSeek-R1-Distill-Qwen series, DeepSeek-R1-Distill-Llama series, their corresponding 4-bit quantized models, and the reasoning model QwQ-32B) using our enhanced A-Eval benchmark, A-Eval-2.0. Our systematic analysis reveals several key insights: (1) Given identical model architectures and training data, larger parameter models demonstrate superior performance, aligning with the scaling law. However, smaller models may achieve enhanced capabilities when employing optimized training strategies and higher-quality data; (2) Reasoning-enhanced model show significant performance gains in logical reasoning tasks but may underperform in text understanding and generation tasks; (3) As the data difficulty increases, distillation or reasoning enhancements yield higher performance gains for the models. Interestingly, reasoning enhancements can even have a negative impact on simpler problems; (4) Quantization impacts different capabilities unevenly, with significant drop on logical reasoning and minimal impact on text generation. Based on these results and findings, we design an model selection handbook enabling users to select the most cost-effective models without efforts.

Quantifying the Capability Boundary of DeepSeek Models: An Application-Driven Performance Analysis

TL;DR

This work addresses the gap between benchmark SOTA performance and real-world deployment for DeepSeek models by evaluating a wide range of variants (including 4-bit quantized and QwQ-32B) on an enhanced, application-driven benchmark, A-Eval-2.0. It introduces a rigorous automatic scoring pipeline with two-phase expert validation and provides a multi-faceted analysis across model scale, reasoning enhancements, distillation, and quantization. Key contributions include open-source A-Eval-2.0, quantified capability boundaries via a model-tier framework, and a practical model selection notebook to guide cost-effective deployment in real-world tasks. The findings reveal nuanced tradeoffs: larger models generally perform better, reasoning enhancements boost logical reasoning and planning but can impair other tasks, and distillation/quantization effects vary by task, informing practitioners on when and what to deploy for given applications.

Abstract

DeepSeek-R1, known for its low training cost and exceptional reasoning capabilities, has achieved state-of-the-art performance on various benchmarks. However, detailed evaluations for DeepSeek Series models from the perspective of real-world applications are lacking, making it challenging for users to select the most suitable DeepSeek models for their specific needs. To address this gap, we presents the first comprehensive evaluation of the DeepSeek and its related models (including DeepSeek-V3, DeepSeek-R1, DeepSeek-R1-Distill-Qwen series, DeepSeek-R1-Distill-Llama series, their corresponding 4-bit quantized models, and the reasoning model QwQ-32B) using our enhanced A-Eval benchmark, A-Eval-2.0. Our systematic analysis reveals several key insights: (1) Given identical model architectures and training data, larger parameter models demonstrate superior performance, aligning with the scaling law. However, smaller models may achieve enhanced capabilities when employing optimized training strategies and higher-quality data; (2) Reasoning-enhanced model show significant performance gains in logical reasoning tasks but may underperform in text understanding and generation tasks; (3) As the data difficulty increases, distillation or reasoning enhancements yield higher performance gains for the models. Interestingly, reasoning enhancements can even have a negative impact on simpler problems; (4) Quantization impacts different capabilities unevenly, with significant drop on logical reasoning and minimal impact on text generation. Based on these results and findings, we design an model selection handbook enabling users to select the most cost-effective models without efforts.

Paper Structure

This paper contains 15 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Average score of DeepSeek series models on A-Eval-2.0. (a) The overall average score of models across all data. (b) to (f). The average scores of models on each task. The "Instruct Model" refers to the original Instruction-tuned models without reasoning enhancement, the "Reasoning Enhancement Model" refers to reasoning-enhanced models that have been distilled using DeepSeek-R1's reasoning data or the native reasoning models, and the "Quantization Model" refers to 4-bit quantized version of the reasoning enhancement models.
  • Figure 2: Average scores of the DeepSeek series models on the 27 subcategories.
  • Figure 3: Performance of each model group on five major tasks.
  • Figure 4: Line charts for evaluation performance. (a) Line chart on five major tasks. (b) - (f) Line charts on five major tasks and corresponding subtasks
  • Figure 5: Average performance of 22 models on Easy, Medium, and Hard data.
  • ...and 5 more figures