Table of Contents
Fetching ...

ChainBuddy: An AI Agent System for Generating LLM Pipelines

Jingyue Zhang, Ian Arawjo

TL;DR

ChainBuddy automates end-to-end generation of LLM evaluation pipelines to address the blank-page problem by producing editable starter flows from a single prompt within ChainForge. The approach employs a multi-agent architecture (requirement gathering, planning, task-specific agents, edge/layout) to craft inspectable pipelines and evaluators, forming the AutoLLMOps paradigm. A within-subject usability study finds that AI-assisted workflows reduce workload, boost confidence, and improve expert-rated pipeline quality, though subjective self-assessment and objective performance can diverge, underscoring concerns about over-reliance. The work contributes a practical AI-assisted workflow design, discusses interaction design considerations for requirement elicitation, and outlines future enhancements (data import, editability, diverse templates) to further reduce the blank-page barrier in LLM pipeline creation.

Abstract

As large language models (LLMs) advance, their potential applications have grown significantly. However, it remains difficult to evaluate LLM behavior on user-defined tasks and craft effective pipelines to do so. Many users struggle with where to start, often referred to as the "blank page problem." ChainBuddy, an AI workflow generation assistant built into the ChainForge platform, aims to tackle this issue. From a single prompt or chat, ChainBuddy generates a starter evaluative LLM pipeline in ChainForge aligned to the user's requirements. ChainBuddy offers a straightforward and user-friendly way to plan and evaluate LLM behavior and make the process less daunting and more accessible across a wide range of possible tasks and use cases. We report a within-subjects user study comparing ChainBuddy to the baseline interface. We find that when using AI assistance, participants reported a less demanding workload, felt more confident, and produced higher quality pipelines evaluating LLM behavior. However, we also uncover a mismatch between subjective and objective ratings of performance: participants rated their successfulness similarly across conditions, while independent experts rated participant workflows significantly higher with AI assistance. Drawing connections to the Dunning-Kruger effect, we draw design implications for the future of workflow generation assistants to mitigate the risk of over-reliance.

ChainBuddy: An AI Agent System for Generating LLM Pipelines

TL;DR

ChainBuddy automates end-to-end generation of LLM evaluation pipelines to address the blank-page problem by producing editable starter flows from a single prompt within ChainForge. The approach employs a multi-agent architecture (requirement gathering, planning, task-specific agents, edge/layout) to craft inspectable pipelines and evaluators, forming the AutoLLMOps paradigm. A within-subject usability study finds that AI-assisted workflows reduce workload, boost confidence, and improve expert-rated pipeline quality, though subjective self-assessment and objective performance can diverge, underscoring concerns about over-reliance. The work contributes a practical AI-assisted workflow design, discusses interaction design considerations for requirement elicitation, and outlines future enhancements (data import, editability, diverse templates) to further reduce the blank-page barrier in LLM pipeline creation.

Abstract

As large language models (LLMs) advance, their potential applications have grown significantly. However, it remains difficult to evaluate LLM behavior on user-defined tasks and craft effective pipelines to do so. Many users struggle with where to start, often referred to as the "blank page problem." ChainBuddy, an AI workflow generation assistant built into the ChainForge platform, aims to tackle this issue. From a single prompt or chat, ChainBuddy generates a starter evaluative LLM pipeline in ChainForge aligned to the user's requirements. ChainBuddy offers a straightforward and user-friendly way to plan and evaluate LLM behavior and make the process less daunting and more accessible across a wide range of possible tasks and use cases. We report a within-subjects user study comparing ChainBuddy to the baseline interface. We find that when using AI assistance, participants reported a less demanding workload, felt more confident, and produced higher quality pipelines evaluating LLM behavior. However, we also uncover a mismatch between subjective and objective ratings of performance: participants rated their successfulness similarly across conditions, while independent experts rated participant workflows significantly higher with AI assistance. Drawing connections to the Dunning-Kruger effect, we draw design implications for the future of workflow generation assistants to mitigate the risk of over-reliance.
Paper Structure (51 sections, 11 figures, 3 tables)

This paper contains 51 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: ChainBuddy interface and example usage. Users specify requirements (A), ChainBuddy replies with a requirements-gathering form (B) that users can either fill out and send, or follow up with an open-ended chat (C). User presses green button (D) to indicate that they are ready to generate a flow. After a delay of 10-20sec, ChainBuddy produces a starter pipeline (E). Here, the starter pipeline includes example inputs, multiple prompts to try (prompt templates), two queried models, and a Python-based code evaluator.
  • Figure 2: ChainBuddy system architecture. A front-end requirement agent elicits user intent and context (left). When the user presses the generate button (Fig \ref{['fig:interface']}d), a specification of user intent is sent to the Planner in the back-end. The Planner agent breaks down the problem into tasks and sends each to a dedicated agent; the outputs are combined through layout-providing connection agents and merged. The final output is passed to the front-end as a complete ChainForge flow (JSON).
  • Figure 3: Participant responses to Likert Questions for six NASA TLX measures and five system usability measures, grouped by Condition. Asterisk (*) indicates a significant main effect of Condition at $p<0.05$; † indicates another main effect, interaction effect or outliers.
  • Figure 4: Performance ratings of participant workflows by expert raters, with AI support (Assistant) and without (Control). Expert raters were blind to the condition and used the same scoring scale from our technical evaluation (Appendix \ref{['rating-guidelines']}).
  • Figure 5: Interaction traces of participant actions when completing tasks, comparing conditions. Notice the relative dearth of time spent editing input data in the Assistant condition, compared to the Control.
  • ...and 6 more figures