Aviary: training language agents on challenging scientific tasks

Siddharth Narayanan; James D. Braza; Ryan-Rhys Griffiths; Manu Ponnapati; Albert Bou; Jon Laurent; Ori Kabeli; Geemi Wellawatte; Sam Cox; Samuel G. Rodriques; Andrew D. White

Aviary: training language agents on challenging scientific tasks

Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White

TL;DR

<3-5 sentence high-level summary>

Abstract

Solving complex real-world tasks requires cycles of actions and observations. This is particularly true in science, where tasks require many cycles of analysis, tool use, and experimentation. Language agents are promising for automating intellectual tasks in science because they can interact with tools via natural language or code. Yet their flexibility creates conceptual and practical challenges for software implementations, since agents may comprise non-standard components such as internal reasoning, planning, tool usage, as well as the inherent stochasticity of temperature-sampled language models. Here, we introduce Aviary, an extensible gymnasium for language agents. We formalize agents as policies solving language-grounded partially observable Markov decision processes, which we term language decision processes. We then implement five environments, including three challenging scientific environments: (1) manipulating DNA constructs for molecular cloning, (2) answering research questions by accessing scientific literature, and (3) engineering protein stability. These environments were selected for their focus on multi-step reasoning and their relevance to contemporary biology research. Finally, with online training and scaling inference-time compute, we show that language agents backed by open-source, non-frontier LLMs can match and exceed both frontier LLM agents and human experts on multiple tasks at up to 100x lower inference cost.

Aviary: training language agents on challenging scientific tasks

TL;DR

<3-5 sentence high-level summary>

Abstract

Paper Structure (25 sections, 4 equations, 7 figures, 1 algorithm)

This paper contains 25 sections, 4 equations, 7 figures, 1 algorithm.

Introduction
Related Work
Language Agent Formalisms
Language Agent Optimization Frameworks
Language Agent Benchmarks
Theory
Language Decision Processes
Stochastic Computation Graphs
Training Methods
Behavior cloning (BC)
Expert iteration (EI)
Inference Compute Scaling
Environments
GSM8K
hotpotQA
...and 10 more sections

Figures (7)

Figure 1: An overview of the five implemented Aviary environments and the language decision process (LDP) framework. The term language decision process here jointly refers to our theoretical description of the class of problems solved by language agents, as well as a software framework for implementing language agents based on a stochastic computation graph that enables training of language agent components such as LLM weights, prompts, memories, or LLM sampling parameters like temperature.
Figure 2: Simple language agent architectures represented as stochastic computation graphs. Deterministic nodes are solid rectangles; stochastic nodes are dashed. Note that we augment the graphs with a deterministic input node to indicate how the observation $o_t$ is consumed.
Figure 3: Ability of LLMs and language agents to solve tasks using Aviary environments. All LDP-trained agents are optimized using behavior cloning and expert iteration. For GSM8K and hotpotQA, EI is performed on gpt-4o; SeqQA and LitQA2 use Llama-3.1-8B-Instruct (see \ref{['fig:ei-results']}). The difference in GSM8K zero-shot reported here (89%) vs Anthropic benchmarks anthropic2024claude3 (96.5%) is likely because Anthropic's use of chain-of-thought prompting, which we did not use. All agents are rolled out on the environment for a maximum of 10 steps, with the exception of PaperQA, which allowed up to 18 steps.
Figure 4: Training language agents to solve (A) SeqQA tasks using the molecular cloning environment and (B) LitQA2 problems using the PaperQA environment. The first epoch represents behavior cloning, followed by expert iteration. These experiments use multiple workers to asynchronously collect trajectories and train the model, so GPU-hours measures the total time spent sampling and training. An untrained Llama-3.1-8B-Instruct agent solves 1% of SeqQA tasks, so we omit the data point at GPU-hours=0 in panel A.
Figure 5: Patterns of tool calls across trajectories. Colored boxes represent different tool categories, and edges between boxes represent consecutive actions taken in a trajectory, with darker edges implying more trajectories. In panel (A), we show the demonstration trajectories used for behavior cloning. Panels (B) and (C) show the trajectories sampled from the Llama-3.1-8B EI agent after expert iteration.
...and 2 more figures

Aviary: training language agents on challenging scientific tasks

TL;DR

Abstract

Aviary: training language agents on challenging scientific tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)