LLM & HPC:Benchmarking DeepSeek's Performance in High-Performance Computing Tasks
Noujoud Nader, Patrick Diehl, Steve Brandt, Hartmut Kaiser
TL;DR
This study benchmarks DeepSeek for HPC-oriented code generation across four languages (C++, Fortran, Python, Julia) over five kernels (CG solver, 1D heat equation, parallel matrix multiplication, DGEMM, STREAM triad) and contrasts results with GPT-4. It provides a structured methodology, evaluates compilation, runtime behavior, and correctness, and analyzes scalability and performance across diverse architectures. The findings show that while DeepSeek can produce functional HPC code, its scalability and execution efficiency lag GPT-4, with notable language-specific challenges in Fortran and Julia, and limited gains for DGEMM and memory-bandwidth benchmarks. The work underscores that LLM-assisted code generation can reduce development effort but is not yet a replacement for optimized HPC programming, and points to future work on distributed computing and accelerator-aware abstractions.
Abstract
Large Language Models (LLMs), such as GPT-4 and DeepSeek, have been applied to a wide range of domains in software engineering. However, their potential in the context of High-Performance Computing (HPC) much remains to be explored. This paper evaluates how well DeepSeek, a recent LLM, performs in generating a set of HPC benchmark codes: a conjugate gradient solver, the parallel heat equation, parallel matrix multiplication, DGEMM, and the STREAM triad operation. We analyze DeepSeek's code generation capabilities for traditional HPC languages like Cpp, Fortran, Julia and Python. The evaluation includes testing for code correctness, performance, and scaling across different configurations and matrix sizes. We also provide a detailed comparison between DeepSeek and another widely used tool: GPT-4. Our results demonstrate that while DeepSeek generates functional code for HPC tasks, it lags behind GPT-4, in terms of scalability and execution efficiency of the generated code.
