Table of Contents
Fetching ...

DeepSeek vs. ChatGPT vs. Claude: A Comparative Study for Scientific Computing and Scientific Machine Learning Tasks

Qile Jiang, Zhiwei Gao, George Em Karniadakis

TL;DR

The paper interrogates how current LLMs, including reasoning-augmented variants, perform on rigorous scientific computing tasks spanning traditional numerical methods and scientific machine learning. By designing tricky problems that force explicit method and parameter choices, the authors show that reasoning-enabled models generally outperform non-reasoning ones, with notable successes from ChatGPT o3-mini-high and Claude 3.7 Sonnet extended thinking in several tasks. The study reveals both strengths (adaptive method selection, robust handling of complex domains) and weaknesses (bugs, hyperparameter sensitivity, inefficiencies in tensor operations) across model families. The findings underscore the potential of reasoning-augmented LLMs for scientific problem-solving while calling for standardized benchmarks and careful human-in-the-loop validation for deployment in research workflows.

Abstract

Large Language Models (LLMs) have emerged as powerful tools for tackling a wide range of problems, including those in scientific computing, particularly in solving partial differential equations (PDEs). However, different models exhibit distinct strengths and preferences, resulting in varying levels of performance. In this paper, we compare the capabilities of the most advanced LLMs--DeepSeek, ChatGPT, and Claude--along with their reasoning-optimized versions in addressing computational challenges. Specifically, we evaluate their proficiency in solving traditional numerical problems in scientific computing as well as leveraging scientific machine learning techniques for PDE-based problems. We designed all our experiments so that a non-trivial decision is required, e.g. defining the proper space of input functions for neural operator learning. Our findings show that reasoning and hybrid-reasoning models consistently and significantly outperform non-reasoning ones in solving challenging problems, with ChatGPT o3-mini-high generally offering the fastest reasoning speed.

DeepSeek vs. ChatGPT vs. Claude: A Comparative Study for Scientific Computing and Scientific Machine Learning Tasks

TL;DR

The paper interrogates how current LLMs, including reasoning-augmented variants, perform on rigorous scientific computing tasks spanning traditional numerical methods and scientific machine learning. By designing tricky problems that force explicit method and parameter choices, the authors show that reasoning-enabled models generally outperform non-reasoning ones, with notable successes from ChatGPT o3-mini-high and Claude 3.7 Sonnet extended thinking in several tasks. The study reveals both strengths (adaptive method selection, robust handling of complex domains) and weaknesses (bugs, hyperparameter sensitivity, inefficiencies in tensor operations) across model families. The findings underscore the potential of reasoning-augmented LLMs for scientific problem-solving while calling for standardized benchmarks and careful human-in-the-loop validation for deployment in research workflows.

Abstract

Large Language Models (LLMs) have emerged as powerful tools for tackling a wide range of problems, including those in scientific computing, particularly in solving partial differential equations (PDEs). However, different models exhibit distinct strengths and preferences, resulting in varying levels of performance. In this paper, we compare the capabilities of the most advanced LLMs--DeepSeek, ChatGPT, and Claude--along with their reasoning-optimized versions in addressing computational challenges. Specifically, we evaluate their proficiency in solving traditional numerical problems in scientific computing as well as leveraging scientific machine learning techniques for PDE-based problems. We designed all our experiments so that a non-trivial decision is required, e.g. defining the proper space of input functions for neural operator learning. Our findings show that reasoning and hybrid-reasoning models consistently and significantly outperform non-reasoning ones in solving challenging problems, with ChatGPT o3-mini-high generally offering the fastest reasoning speed.

Paper Structure

This paper contains 13 sections, 14 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Solutions to the Robertson ODEs computed using the numerical scheme selected by each LLM. For reference, the ODE system is also solved in scipy using the Radau method. For better visualization, the concentration for the $y$ species is scaled by $10^4$ in plots where the solution converges.
  • Figure 2: The predicted solutions given by different models for the Poisson equation.
  • Figure 3: The true solution for the beam equation.
  • Figure 4: Predicted solutions from different LLMs for the beam equation.
  • Figure 5: Convergence rate of different methods tested for integration.
  • ...and 3 more figures