ChainBuddy: An AI Agent System for Generating LLM Pipelines
Jingyue Zhang, Ian Arawjo
TL;DR
ChainBuddy automates end-to-end generation of LLM evaluation pipelines to address the blank-page problem by producing editable starter flows from a single prompt within ChainForge. The approach employs a multi-agent architecture (requirement gathering, planning, task-specific agents, edge/layout) to craft inspectable pipelines and evaluators, forming the AutoLLMOps paradigm. A within-subject usability study finds that AI-assisted workflows reduce workload, boost confidence, and improve expert-rated pipeline quality, though subjective self-assessment and objective performance can diverge, underscoring concerns about over-reliance. The work contributes a practical AI-assisted workflow design, discusses interaction design considerations for requirement elicitation, and outlines future enhancements (data import, editability, diverse templates) to further reduce the blank-page barrier in LLM pipeline creation.
Abstract
As large language models (LLMs) advance, their potential applications have grown significantly. However, it remains difficult to evaluate LLM behavior on user-defined tasks and craft effective pipelines to do so. Many users struggle with where to start, often referred to as the "blank page problem." ChainBuddy, an AI workflow generation assistant built into the ChainForge platform, aims to tackle this issue. From a single prompt or chat, ChainBuddy generates a starter evaluative LLM pipeline in ChainForge aligned to the user's requirements. ChainBuddy offers a straightforward and user-friendly way to plan and evaluate LLM behavior and make the process less daunting and more accessible across a wide range of possible tasks and use cases. We report a within-subjects user study comparing ChainBuddy to the baseline interface. We find that when using AI assistance, participants reported a less demanding workload, felt more confident, and produced higher quality pipelines evaluating LLM behavior. However, we also uncover a mismatch between subjective and objective ratings of performance: participants rated their successfulness similarly across conditions, while independent experts rated participant workflows significantly higher with AI assistance. Drawing connections to the Dunning-Kruger effect, we draw design implications for the future of workflow generation assistants to mitigate the risk of over-reliance.
