Table of Contents
Fetching ...

LLM & HPC:Benchmarking DeepSeek's Performance in High-Performance Computing Tasks

Noujoud Nader, Patrick Diehl, Steve Brandt, Hartmut Kaiser

TL;DR

This study benchmarks DeepSeek for HPC-oriented code generation across four languages (C++, Fortran, Python, Julia) over five kernels (CG solver, 1D heat equation, parallel matrix multiplication, DGEMM, STREAM triad) and contrasts results with GPT-4. It provides a structured methodology, evaluates compilation, runtime behavior, and correctness, and analyzes scalability and performance across diverse architectures. The findings show that while DeepSeek can produce functional HPC code, its scalability and execution efficiency lag GPT-4, with notable language-specific challenges in Fortran and Julia, and limited gains for DGEMM and memory-bandwidth benchmarks. The work underscores that LLM-assisted code generation can reduce development effort but is not yet a replacement for optimized HPC programming, and points to future work on distributed computing and accelerator-aware abstractions.

Abstract

Large Language Models (LLMs), such as GPT-4 and DeepSeek, have been applied to a wide range of domains in software engineering. However, their potential in the context of High-Performance Computing (HPC) much remains to be explored. This paper evaluates how well DeepSeek, a recent LLM, performs in generating a set of HPC benchmark codes: a conjugate gradient solver, the parallel heat equation, parallel matrix multiplication, DGEMM, and the STREAM triad operation. We analyze DeepSeek's code generation capabilities for traditional HPC languages like Cpp, Fortran, Julia and Python. The evaluation includes testing for code correctness, performance, and scaling across different configurations and matrix sizes. We also provide a detailed comparison between DeepSeek and another widely used tool: GPT-4. Our results demonstrate that while DeepSeek generates functional code for HPC tasks, it lags behind GPT-4, in terms of scalability and execution efficiency of the generated code.

LLM & HPC:Benchmarking DeepSeek's Performance in High-Performance Computing Tasks

TL;DR

This study benchmarks DeepSeek for HPC-oriented code generation across four languages (C++, Fortran, Python, Julia) over five kernels (CG solver, 1D heat equation, parallel matrix multiplication, DGEMM, STREAM triad) and contrasts results with GPT-4. It provides a structured methodology, evaluates compilation, runtime behavior, and correctness, and analyzes scalability and performance across diverse architectures. The findings show that while DeepSeek can produce functional HPC code, its scalability and execution efficiency lag GPT-4, with notable language-specific challenges in Fortran and Julia, and limited gains for DGEMM and memory-bandwidth benchmarks. The work underscores that LLM-assisted code generation can reduce development effort but is not yet a replacement for optimized HPC programming, and points to future work on distributed computing and accelerator-aware abstractions.

Abstract

Large Language Models (LLMs), such as GPT-4 and DeepSeek, have been applied to a wide range of domains in software engineering. However, their potential in the context of High-Performance Computing (HPC) much remains to be explored. This paper evaluates how well DeepSeek, a recent LLM, performs in generating a set of HPC benchmark codes: a conjugate gradient solver, the parallel heat equation, parallel matrix multiplication, DGEMM, and the STREAM triad operation. We analyze DeepSeek's code generation capabilities for traditional HPC languages like Cpp, Fortran, Julia and Python. The evaluation includes testing for code correctness, performance, and scaling across different configurations and matrix sizes. We also provide a detailed comparison between DeepSeek and another widely used tool: GPT-4. Our results demonstrate that while DeepSeek generates functional code for HPC tasks, it lags behind GPT-4, in terms of scalability and execution efficiency of the generated code.

Paper Structure

This paper contains 12 sections, 7 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Two-dimensional classification using the estimated schedule effort of the COCOMO model (easy vs difficult) and the quality of the code using compilation, execution, and correctness (poor vs good). The blue values show the results from ChatGPT diehl2024evaluating and the black values the results for Deep Seek. The Python and Julia data points are tagged with the common file endings py and jl, respectively.
  • Figure 2: Performance measurements: (\ref{['fig:performance:heat']}) parallel heat equation solver on x86-AMD, (\ref{['fig:performance:matrix']}) parallel matrix multiplication on x86-Intel, (\ref{['fig:performance:dgemm']}) DGEMM on Arm A64FX, and (\ref{['fig:performance:stream']}) stream triad on Arm Grace Hopper.