Table of Contents
Fetching ...

Training a Generally Curious Agent

Fahim Tajwar, Yiding Jiang, Abitha Thankaraj, Sumaita Sadia Rahman, J Zico Kolter, Jeff Schneider, Ruslan Salakhutdinov

TL;DR

Paprika presents a scalable fine-tuning framework that endows LLMs with general in-context sequential decision-making by training on diverse synthetic task groups that require strategic information gathering. The method combines SFT, multi-turn DPO, and a curriculum-driven data-sampling loop to maximize learning from trajectories while mitigating data-collection costs. Empirical results show robust zero-shot transfer to unseen task groups, improved data efficiency with curriculum learning, and no degradation on standard benchmarks, suggesting a practical path toward autonomous agents capable of solving novel sequential decision problems. These findings highlight the potential of amortized exploration and in-context RL as a route to general-purpose decision-making in interacting with the external world.

Abstract

Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present Paprika, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, Paprika teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with Paprika can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach's primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.

Training a Generally Curious Agent

TL;DR

Paprika presents a scalable fine-tuning framework that endows LLMs with general in-context sequential decision-making by training on diverse synthetic task groups that require strategic information gathering. The method combines SFT, multi-turn DPO, and a curriculum-driven data-sampling loop to maximize learning from trajectories while mitigating data-collection costs. Empirical results show robust zero-shot transfer to unseen task groups, improved data efficiency with curriculum learning, and no degradation on standard benchmarks, suggesting a practical path toward autonomous agents capable of solving novel sequential decision problems. These findings highlight the potential of amortized exploration and in-context RL as a route to general-purpose decision-making in interacting with the external world.

Abstract

Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present Paprika, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, Paprika teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with Paprika can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach's primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.

Paper Structure

This paper contains 74 sections, 5 equations, 19 figures, 8 tables, 1 algorithm.

Figures (19)

  • Figure 1: (Overview of Paprika) We design a diverse set of tasks where an LLM agent needs strategic information gathering to succeed, then train an LLM on self-generated data to prefer higher performing trajectories. The resulting behavior learned by Paprika can transfer zero-shot to unseen tasks, showcasing its potential to build general decision making agents.
  • Figure 2: (Paprika improves success rate on a diverse range of task groups) Average success rate on all 10 task groups at temperature 0.7. Paprika generally improves performance of both Llama-3.1-8B-Instruct and Gemma-3-12B-IT models.
  • Figure 3: (Testing generalization of Paprika via leave-one-out and single task group experiments) We test Paprika's zero-shot performance on unseen task groups by leave-one-out (LOO) experiments, where we train the LLM on every task group except the group we test on. We also report the performance of Paprika (Single Task Group), where we train and test the LLM on a single group. Our experiments demonstrate that Paprika can teach an LLM decision making abilities that often transfer well to new tasks without any additional training, and the model also generally learns better in-group strategies when it observes trajectories from other task groups.
  • Figure 4: (Multi-round training with curriculum on twenty questions) We demonstrate the efficacy of our curriculum learning algorithm for sampling training tasks by comparing its performance against uniform sampling for multi-round training. All experiments use Llama-3.1-8B-Instruct as the initial model, evaluations are done at temperature 0.7, and shaded regions represent standard error over 3 seeds. (Left) Average success rate at each round. (Middle) Pass@4 success rate at each round. (Right) Success rate per each of easy, medium, and hard task groups. Overall, our curriculum learning algorithm shows 1.4% and 3.3% improvement over the uniform sampling baseline at average and pass@4 success rate respectively.
  • Figure 5: (Paprika improves success rate (pass@4)) Pass@4 success rate of Paprika-finetuned Llama-3.1-8B-Instruct vs other models evaluated across temperatures 0.3, 0.7 and 1.0. See that Paprika, when trained on trajectories from all task groups, shows significant improvement across all of them. We also compare against a Llama-3.1-8B-Instruct model finetuned on 100,000 trajectories randomly sampled from the WildChat dataset. This model performs poorly on all tasks, possibly due to model collapse.
  • ...and 14 more figures