Quantifying the Capability Boundary of DeepSeek Models: An Application-Driven Performance Analysis

Kaikai Zhao; Zhaoxiang Liu; Xuejiao Lei; Jiaojiao Zhao; Zhenhong Long; Zipeng Wang; Ning Wang; Meijuan An; Qingliang Meng; Peijun Yang; Minjie Hua; Chaoyang Ma; Wen Liu; Kai Wang; Shiguo Lian

Quantifying the Capability Boundary of DeepSeek Models: An Application-Driven Performance Analysis

Kaikai Zhao, Zhaoxiang Liu, Xuejiao Lei, Jiaojiao Zhao, Zhenhong Long, Zipeng Wang, Ning Wang, Meijuan An, Qingliang Meng, Peijun Yang, Minjie Hua, Chaoyang Ma, Wen Liu, Kai Wang, Shiguo Lian

TL;DR

This work addresses the gap between benchmark SOTA performance and real-world deployment for DeepSeek models by evaluating a wide range of variants (including 4-bit quantized and QwQ-32B) on an enhanced, application-driven benchmark, A-Eval-2.0. It introduces a rigorous automatic scoring pipeline with two-phase expert validation and provides a multi-faceted analysis across model scale, reasoning enhancements, distillation, and quantization. Key contributions include open-source A-Eval-2.0, quantified capability boundaries via a model-tier framework, and a practical model selection notebook to guide cost-effective deployment in real-world tasks. The findings reveal nuanced tradeoffs: larger models generally perform better, reasoning enhancements boost logical reasoning and planning but can impair other tasks, and distillation/quantization effects vary by task, informing practitioners on when and what to deploy for given applications.

Abstract

DeepSeek-R1, known for its low training cost and exceptional reasoning capabilities, has achieved state-of-the-art performance on various benchmarks. However, detailed evaluations for DeepSeek Series models from the perspective of real-world applications are lacking, making it challenging for users to select the most suitable DeepSeek models for their specific needs. To address this gap, we presents the first comprehensive evaluation of the DeepSeek and its related models (including DeepSeek-V3, DeepSeek-R1, DeepSeek-R1-Distill-Qwen series, DeepSeek-R1-Distill-Llama series, their corresponding 4-bit quantized models, and the reasoning model QwQ-32B) using our enhanced A-Eval benchmark, A-Eval-2.0. Our systematic analysis reveals several key insights: (1) Given identical model architectures and training data, larger parameter models demonstrate superior performance, aligning with the scaling law. However, smaller models may achieve enhanced capabilities when employing optimized training strategies and higher-quality data; (2) Reasoning-enhanced model show significant performance gains in logical reasoning tasks but may underperform in text understanding and generation tasks; (3) As the data difficulty increases, distillation or reasoning enhancements yield higher performance gains for the models. Interestingly, reasoning enhancements can even have a negative impact on simpler problems; (4) Quantization impacts different capabilities unevenly, with significant drop on logical reasoning and minimal impact on text generation. Based on these results and findings, we design an model selection handbook enabling users to select the most cost-effective models without efforts.

Quantifying the Capability Boundary of DeepSeek Models: An Application-Driven Performance Analysis

TL;DR

Abstract

Quantifying the Capability Boundary of DeepSeek Models: An Application-Driven Performance Analysis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)