Table of Contents
Fetching ...

Scaffolded Language Models with Language Supervision for Mixed-Autonomy: A Survey

Matthieu Lin, Jenny Sheng, Andrew Zhao, Shenzhi Wang, Yang Yue, Victor Shea Jay Huang, Huan Liu, Jun Liu, Gao Huang, Yong-Jin Liu

TL;DR

This survey introduces scaffolded language models with language supervision, a semi-parametric paradigm where post-trained LMs are coupled with non-parametric variables such as prompts and tools updated through natural-language feedback. It distinguishes parametric training (weights) from non-parametric, language-space optimization, and surveys three strands: prompt optimization, experiential learning, and AutoDiff-style frameworks within agents and workflows. A unifying taxonomy of scaffolded LMs—focusing on mixed-autonomy settings where humans and AI share control—leads to a review of benchmarks and real-world deployments like Copilot-style assistants. The authors advocate streaming learning from language, continual non-parametric updates, and interpretable, human-in-the-loop adaptation as key advantages over traditional parametric training, while calling out limitations and future directions for scaling, efficiency, and multi-modal extensions.

Abstract

This survey organizes the intricate literature on the design and optimization of emerging structures around post-trained LMs. We refer to this overarching structure as scaffolded LMs and focus on LMs that are integrated into multi-step processes with tools. We view scaffolded LMs as semi-parametric models wherein we train non-parametric variables, including the prompt, tools, and scaffold's code. In particular, they interpret instructions, use tools, and receive feedback all in language. Recent works use an LM as an optimizer to interpret language supervision and update non-parametric variables according to intricate objectives. In this survey, we refer to this paradigm as training of scaffolded LMs with language supervision. A key feature of non-parametric training is the ability to learn from language. Parametric training excels in learning from demonstration (supervised learning), exploration (reinforcement learning), or observations (unsupervised learning), using well-defined loss functions. Language-based optimization enables rich, interpretable, and expressive objectives, while mitigating issues like catastrophic forgetting and supporting compatibility with closed-source models. Furthermore, agents are increasingly deployed as co-workers in real-world applications such as Copilot in Office tools or software development. In these mixed-autonomy settings, where control and decision-making are shared between human and AI, users point out errors or suggest corrections. Accordingly, we discuss agents that continuously improve by learning from this real-time, language-based feedback and refer to this setting as streaming learning from language supervision.

Scaffolded Language Models with Language Supervision for Mixed-Autonomy: A Survey

TL;DR

This survey introduces scaffolded language models with language supervision, a semi-parametric paradigm where post-trained LMs are coupled with non-parametric variables such as prompts and tools updated through natural-language feedback. It distinguishes parametric training (weights) from non-parametric, language-space optimization, and surveys three strands: prompt optimization, experiential learning, and AutoDiff-style frameworks within agents and workflows. A unifying taxonomy of scaffolded LMs—focusing on mixed-autonomy settings where humans and AI share control—leads to a review of benchmarks and real-world deployments like Copilot-style assistants. The authors advocate streaming learning from language, continual non-parametric updates, and interpretable, human-in-the-loop adaptation as key advantages over traditional parametric training, while calling out limitations and future directions for scaling, efficiency, and multi-modal extensions.

Abstract

This survey organizes the intricate literature on the design and optimization of emerging structures around post-trained LMs. We refer to this overarching structure as scaffolded LMs and focus on LMs that are integrated into multi-step processes with tools. We view scaffolded LMs as semi-parametric models wherein we train non-parametric variables, including the prompt, tools, and scaffold's code. In particular, they interpret instructions, use tools, and receive feedback all in language. Recent works use an LM as an optimizer to interpret language supervision and update non-parametric variables according to intricate objectives. In this survey, we refer to this paradigm as training of scaffolded LMs with language supervision. A key feature of non-parametric training is the ability to learn from language. Parametric training excels in learning from demonstration (supervised learning), exploration (reinforcement learning), or observations (unsupervised learning), using well-defined loss functions. Language-based optimization enables rich, interpretable, and expressive objectives, while mitigating issues like catastrophic forgetting and supporting compatibility with closed-source models. Furthermore, agents are increasingly deployed as co-workers in real-world applications such as Copilot in Office tools or software development. In these mixed-autonomy settings, where control and decision-making are shared between human and AI, users point out errors or suggest corrections. Accordingly, we discuss agents that continuously improve by learning from this real-time, language-based feedback and refer to this setting as streaming learning from language supervision.

Paper Structure

This paper contains 28 sections, 3 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Non-parametric training of scaffolded LMs focuses on learning from language. Parametric training has excelled in learning from demonstration (supervised learning), exploration (reinforcement learning), and observation (unsupervised learning) by using well-defined loss functions. Instead, non-parametric training enables using an LM to optimize in the language space, allowing for efficient and interpretable learning with fuzzy objectives and rich textual feedback. As scaffolded LMs interact with digital systems operated by humans, mixed autonomy becomes increasingly important mixedautonomy. We anticipate that learning from language will serve as a crucial framework for scaffolded LMs that inhabit streams of experience from rich user feedback in mixed autonomy silver2025era. Importantly, it circumvents catastrophic forgetting, supports compatibility with closed-source models, and is interpretable Hadsell2020. We refer readers to Sec. \ref{['sec:beyondiid']} for further discussion.
  • Figure 2: Organization of this survey. This survey introduces a unifying paradigm centered on learning from language using the non-parametric variables of a scaffolded language model. Despite the complexity and fragmentation of the literature across domains such as language agents, prompt optimization, experiential learning, and compound AI systems, we adopt a deliberately simple and clear organizational structure. This simplicity is not a limitation, but a reflection of our effort to distill a coherent framework from a rapidly evolving field. It enables us to outline key research opportunities in learning with non-parametric variables in mixed-autonomy settings.
  • Figure 3: Glossary of core terms.
  • Figure 4: Example of chat history in LLaMA-3. Adapted from grattafiori2024llama3herdmodels. A chat history starts with a prompt comprising a system and user message in LLaMA-3. The system message enumerates available tools (as JSON schemas) and sets high-level behavior. The user message specifies the task and the data. During the multi-step process, the chat history is updated with assistant messages (LM responses), user feedback, and tool responses.
  • Figure 5: Multi-step process of a scaffolded LM. The distinctive trait between agent and workflow is that for each workflow step, a different prompt is written to pre-define the sub-task at that step. This makes workflows much more efficient and predictable than agents. However, workflows are engineered for specific scenarios. In this figure, we assume the case where the tool execution happens on the developer side. For built-in tools, tool execution happens on the provider's side openai2025agents.
  • ...and 7 more figures