Do Large Language Models Speak Scientific Workflows?
Orcun Yildiz, Tom Peterka
TL;DR
The paper addresses the problem of applying large language models to scientific workflows, a domain with complex task dependencies and data requirements. It conducts a systematic, non fine tuned evaluation of multiple LLMs across five workflow systems on three tasks: workflow configuration, task code annotation, and task code translation. The study finds that LLMs exhibit limited domain knowledge, with performance varying by model and system and frequent hallucinations, though prompting strategies such as few shot prompting and incorporating external knowledge can improve results. The results provide a baseline for future work and suggest directions like retrieval augmented generation and iterative error correction to enhance LLMs' utility in scientific workflows, aiding workflow developers and users in understanding model capabilities.
Abstract
With the advent of large language models (LLMs), there is a growing interest in applying LLMs to scientific tasks. In this work, we conduct an experimental study to explore applicability of LLMs for configuring, annotating, translating, explaining, and generating scientific workflows. We use 5 different workflow specific experiments and evaluate several open- and closed-source language models using state-of-the-art workflow systems. Our studies reveal that LLMs often struggle with workflow related tasks due to their lack of knowledge of scientific workflows. We further observe that the performance of LLMs varies across experiments and workflow systems. Our findings can help workflow developers and users in understanding LLMs capabilities in scientific workflows, and motivate further research applying LLMs to workflows.
