Table of Contents
Fetching ...

Towards Reliable Generation of Executable Workflows by Foundation Models

Sogol Masoumzadeh, Keheliya Gallaba, Dayi Lin, Ahmed E. Hassan

Abstract

Recent advancements in Foundation Models (FMs) have demonstrated significant progress in processing complex natural language to perform intricate tasks. Successfully executing these tasks often requires orchestrating calls to FMs alongside other software components. However, manually decomposing a task into a coherent sequence of smaller, logically aggregated steps, commonly referred to as workflows, demands considerable effort and specialized domain knowledge. While FMs can assist in generating such workflows specified in domain-specific languages (DSLs), achieving accuracy and reliability in this process remains a challenge. We introduce a framework that leverages static analysis feedback to enable FMs to detect and repair defects in the DSL-based workflows they generate. We begin by presenting an initial taxonomy of defect occurrences in FM-generated DSL workflows, categorizing them into 20 distinct types. Furthermore, we observe a high prevalence of defects across FM-generated DSL workflows, with 89.23% of the studied instances containing at least one defect. This high prevalence underscores the magnitude of the problem and the necessity for mitigation strategies. Following this, we demonstrate that nine types of these defects can be effectively identified through static analysis of the workflows. For this purpose, we develop Timon, the first-of-its-kind static analyzer specifically designed for FM-generated DSL workflows. Finally, we show that by incorporating feedback from Timon, we can guide Pumbaa, an FM-based tool, to repair the detected defect incidences. By systematically detecting and repairing defects, our work takes a crucial step towards the reliable and automated generation of executable workflows from natural-language requirements.

Towards Reliable Generation of Executable Workflows by Foundation Models

Abstract

Recent advancements in Foundation Models (FMs) have demonstrated significant progress in processing complex natural language to perform intricate tasks. Successfully executing these tasks often requires orchestrating calls to FMs alongside other software components. However, manually decomposing a task into a coherent sequence of smaller, logically aggregated steps, commonly referred to as workflows, demands considerable effort and specialized domain knowledge. While FMs can assist in generating such workflows specified in domain-specific languages (DSLs), achieving accuracy and reliability in this process remains a challenge. We introduce a framework that leverages static analysis feedback to enable FMs to detect and repair defects in the DSL-based workflows they generate. We begin by presenting an initial taxonomy of defect occurrences in FM-generated DSL workflows, categorizing them into 20 distinct types. Furthermore, we observe a high prevalence of defects across FM-generated DSL workflows, with 89.23% of the studied instances containing at least one defect. This high prevalence underscores the magnitude of the problem and the necessity for mitigation strategies. Following this, we demonstrate that nine types of these defects can be effectively identified through static analysis of the workflows. For this purpose, we develop Timon, the first-of-its-kind static analyzer specifically designed for FM-generated DSL workflows. Finally, we show that by incorporating feedback from Timon, we can guide Pumbaa, an FM-based tool, to repair the detected defect incidences. By systematically detecting and repairing defects, our work takes a crucial step towards the reliable and automated generation of executable workflows from natural-language requirements.

Paper Structure

This paper contains 25 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview of workflow generation with FMs using NL-2-DSL pipeline
  • Figure 2: An example of a defective workflow generated by an FM. Task t2 and Task t5, which are on different branches, have an unsatisfiable data dependency between them.
  • Figure 3: Overview depicting the empirical study of defects in FM-generated DSL workflows
  • Figure 4: The taxonomy of $20$ defect types in FM-generated DSL workflows, organized into three main categories. Numbers in brackets indicate the total count of each defect found in our 65-sample open-coded dataset.
  • Figure 5: The overview of Timon: The static analyzer for DSL workflow defect detection
  • ...and 3 more figures