AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts
Tongshuang Wu, Michael Terry, Carrie J. Cai
TL;DR
The paper proposes Chaining, a method to decompose complex tasks into sub-tasks solved by sequential LLM runs, to improve transparency, controllability, and collaboration in human–AI interaction. It defines eight primitive operations, an interactive Chain interface, and a data-flow structure that exposes intermediate results for user editing and debugging. A within-subject user study with LaMDA shows significant gains in perceived transparency and control and higher-quality outputs (~82% success) when using Chains versus a Sandbox baseline. Case studies illustrate broader applicability to visualization debugging and assisted text entry, and the discussion outlines future directions for building more flexible, prototype-friendly LLM-based workflows. Overall, the work demonstrates that task decomposition and visible intermediate steps can unlock LLM latent capabilities and enable rapid prototyping of AI-assisted applications without retraining.
Abstract
Although large language models (LLMs) have demonstrated impressive potential on simple tasks, their breadth of scope, lack of transparency, and insufficient controllability can make them less effective when assisting humans on more complex tasks. In response, we introduce the concept of Chaining LLM steps together, where the output of one step becomes the input for the next, thus aggregating the gains per step. We first define a set of LLM primitive operations useful for Chain construction, then present an interactive system where users can modify these Chains, along with their intermediate results, in a modular way. In a 20-person user study, we found that Chaining not only improved the quality of task outcomes, but also significantly enhanced system transparency, controllability, and sense of collaboration. Additionally, we saw that users developed new ways of interacting with LLMs through Chains: they leveraged sub-tasks to calibrate model expectations, compared and contrasted alternative strategies by observing parallel downstream effects, and debugged unexpected model outputs by "unit-testing" sub-components of a Chain. In two case studies, we further explore how LLM Chains may be used in future applications
