Table of Contents
Fetching ...

Evaluating Large Language Models for Workload Mapping and Scheduling in Heterogeneous HPC Systems

Aasish Kumar Sharma, Julian Kunkel

TL;DR

The paper investigates whether large language models (LLMs) can perform constraint-based HPC workload mapping and makespan calculation from purely natural-language descriptions. Using a representation of a representative heterogeneous HPC scenario, 21 publicly available LLMs are evaluated against a manually derived analytic optimum of $9\text{ h }20\text{ s}$ as ground truth. The study introduces a one-shot evaluation framework, catalogs reasoning and constraint-violation patterns, and finds that a minority of models reproduce the optimum exactly while many near-optimal results emerge, with substantial variability and occasional arithmetic or dependency errors. The findings position LLMs as explainable co-pilots for optimization and decision-support tasks, rather than autonomous solvers, and highlight the need for hybrid pipelines with formal verification to ensure reliability in complex scheduling tasks.

Abstract

Large language models (LLMs) are increasingly explored for their reasoning capabilities, yet their ability to perform structured, constraint-based optimization from natural language remains insufficiently understood. This study evaluates twenty-one publicly available LLMs on a representative heterogeneous high-performance computing (HPC) workload mapping and scheduling problem. Each model received the same textual description of system nodes, task requirements, and scheduling constraints, and was required to assign tasks to nodes, compute the total makespan, and explain its reasoning. A manually derived analytical optimum of nine hours and twenty seconds served as the ground truth reference. Three models exactly reproduced the analytical optimum while satisfying all constraints, twelve achieved near-optimal results within two minutes of the reference, and six produced suboptimal schedules with arithmetic or dependency errors. All models generated feasible task-to-node mappings, though only about half maintained strict constraint adherence. Nineteen models produced partially executable verification code, and eighteen provided coherent step-by-step reasoning, demonstrating strong interpretability even when logical errors occurred. Overall, the results define the current capability boundary of LLM reasoning in combinatorial optimization: leading models can reconstruct optimal schedules directly from natural language, but most still struggle with precise timing, data transfer arithmetic, and dependency enforcement. These findings highlight the potential of LLMs as explainable co-pilots for optimization and decision-support tasks rather than autonomous solvers.

Evaluating Large Language Models for Workload Mapping and Scheduling in Heterogeneous HPC Systems

TL;DR

The paper investigates whether large language models (LLMs) can perform constraint-based HPC workload mapping and makespan calculation from purely natural-language descriptions. Using a representation of a representative heterogeneous HPC scenario, 21 publicly available LLMs are evaluated against a manually derived analytic optimum of as ground truth. The study introduces a one-shot evaluation framework, catalogs reasoning and constraint-violation patterns, and finds that a minority of models reproduce the optimum exactly while many near-optimal results emerge, with substantial variability and occasional arithmetic or dependency errors. The findings position LLMs as explainable co-pilots for optimization and decision-support tasks, rather than autonomous solvers, and highlight the need for hybrid pipelines with formal verification to ensure reliability in complex scheduling tasks.

Abstract

Large language models (LLMs) are increasingly explored for their reasoning capabilities, yet their ability to perform structured, constraint-based optimization from natural language remains insufficiently understood. This study evaluates twenty-one publicly available LLMs on a representative heterogeneous high-performance computing (HPC) workload mapping and scheduling problem. Each model received the same textual description of system nodes, task requirements, and scheduling constraints, and was required to assign tasks to nodes, compute the total makespan, and explain its reasoning. A manually derived analytical optimum of nine hours and twenty seconds served as the ground truth reference. Three models exactly reproduced the analytical optimum while satisfying all constraints, twelve achieved near-optimal results within two minutes of the reference, and six produced suboptimal schedules with arithmetic or dependency errors. All models generated feasible task-to-node mappings, though only about half maintained strict constraint adherence. Nineteen models produced partially executable verification code, and eighteen provided coherent step-by-step reasoning, demonstrating strong interpretability even when logical errors occurred. Overall, the results define the current capability boundary of LLM reasoning in combinatorial optimization: leading models can reconstruct optimal schedules directly from natural language, but most still struggle with precise timing, data transfer arithmetic, and dependency enforcement. These findings highlight the potential of LLMs as explainable co-pilots for optimization and decision-support tasks rather than autonomous solvers.

Paper Structure

This paper contains 17 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Simplified Petri Net representing task dependencies and resource tokens sharma2025workflow.
  • Figure 2: Evaluation methodology: Each LLM receives the same natural-language description of the HPC system and workload as used in the manually derived optimal baseline. Outputs are compared for constraint adherence, correctness of makespan, and quality of reasoning and explanation.
  • Figure 3: DAG for the sample HPC workflow
  • Figure 4: Number of models (out of 21) that succeeded on each qualitative metric.