Table of Contents
Fetching ...

Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL

Xiaofeng Lin, Sirou Zhu, Yilei Chen, Mingyu Chen, Hejian Sang, Ioannis Paschalidis, Zhipeng Wang, Aldo Pacchiano, Xuezhou Zhang

TL;DR

This work tackles the challenge of online decision-making with LLMs by introducing ORBIT, a multi-task, multi-episode meta-reinforcement learning framework that trains LLMs to learn from interaction within the context window. After meta-training, a relatively small model (Qwen3-14B) exhibits strong in-context online learning on unseen tasks, matching GPT-5.2 and outperforming standard RL fine-tuning, with consistent gains as model size grows. The method uses trajectory-based rewards and Group Relative Policy Optimization to encourage task completion and cross-episode adaptation without weight updates. The results suggest substantial headroom for learn-at-inference decision-making agents and highlight ORBIT as a scalable pathway toward general-purpose online agents; code is available at the project repository.

Abstract

Large language models (LLMs) achieve strong performance when all task-relevant information is available upfront, as in static prediction and instruction-following problems. However, many real-world decision-making tasks are inherently online: crucial information must be acquired through interaction, feedback is delayed, and effective behavior requires balancing information collection and exploitation over time. While in-context learning enables adaptation without weight updates, existing LLMs often struggle to reliably leverage in-context interaction experience in such settings. In this work, we show that this limitation can be addressed through training. We introduce ORBIT, a multi-task, multi-episode meta-reinforcement learning framework that trains LLMs to learn from interaction in context. After meta-training, a relatively small open-source model (Qwen3-14B) demonstrates substantially improved in-context online learning on entirely unseen environments, matching the performance of GPT-5.2 and outperforming standard RL fine-tuning by a large margin. Scaling experiments further reveal consistent gains with model size, suggesting significant headroom for learn-at-inference-time decision-making agents. Code reproducing the results in the paper can be found at https://github.com/XiaofengLin7/ORBIT.

Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL

TL;DR

This work tackles the challenge of online decision-making with LLMs by introducing ORBIT, a multi-task, multi-episode meta-reinforcement learning framework that trains LLMs to learn from interaction within the context window. After meta-training, a relatively small model (Qwen3-14B) exhibits strong in-context online learning on unseen tasks, matching GPT-5.2 and outperforming standard RL fine-tuning, with consistent gains as model size grows. The method uses trajectory-based rewards and Group Relative Policy Optimization to encourage task completion and cross-episode adaptation without weight updates. The results suggest substantial headroom for learn-at-inference decision-making agents and highlight ORBIT as a scalable pathway toward general-purpose online agents; code is available at the project repository.

Abstract

Large language models (LLMs) achieve strong performance when all task-relevant information is available upfront, as in static prediction and instruction-following problems. However, many real-world decision-making tasks are inherently online: crucial information must be acquired through interaction, feedback is delayed, and effective behavior requires balancing information collection and exploitation over time. While in-context learning enables adaptation without weight updates, existing LLMs often struggle to reliably leverage in-context interaction experience in such settings. In this work, we show that this limitation can be addressed through training. We introduce ORBIT, a multi-task, multi-episode meta-reinforcement learning framework that trains LLMs to learn from interaction in context. After meta-training, a relatively small open-source model (Qwen3-14B) demonstrates substantially improved in-context online learning on entirely unseen environments, matching the performance of GPT-5.2 and outperforming standard RL fine-tuning by a large margin. Scaling experiments further reveal consistent gains with model size, suggesting significant headroom for learn-at-inference-time decision-making agents. Code reproducing the results in the paper can be found at https://github.com/XiaofengLin7/ORBIT.
Paper Structure (34 sections, 15 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 15 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Left: Orbit gains over the base model on episode‑3 success rate for Maze and Mastermind (w/ Orbit vs. w/o Orbit). Right: Average success rate across Maze and Mastermind over episodes for Orbit, GPT‑4o, GPT‑5.2 (high reasoning effort), and the oracle algorithms for the test environment (details in Appendix \ref{['app:oracle']}) Both tasks are unseen during training.
  • Figure 2: Example of Multi-Episode Meta-RL for Enterprise Tool-Use. (Top) Standard RL typically treats episodes as isolated events with state resets (tabula rasa), preventing the transfer of learned environmental constraints (e.g., API schemas, rate limits) between trials. (Bottom) A Meta-RL framework enables in-context adaptation across episodes. The agent accumulates a persistent interaction history, allowing it to transition from probing unknown tools (Ep 1) to refining its strategy based on errors (Ep 2), and finally exploiting the learned mental model (Ep 3) for reliable execution, all without requiring weight updates.
  • Figure 3: Orbit induces genuine in-context learning beyond single-episode and multitask baselines. Success rate versus episode index on unseen Maze (left) and Mastermind (right) tasks. We compare Orbit (8B) against the base model and RL baseline.
  • Figure 4: A trace of Orbit in Maze. The agent observes only its local surroundings and must rely on its history to guide decisions. The first two episodes (Ep 1 and Ep 2) fail to reach the goal, exploring suboptimal paths. Leveraging reflections over past interactions, the agent adapts its strategy and, in the third episode (Ep 3), follows a new route that successfully reaches the goal.
  • Figure 5: In-context learning across episodes on unseen test tasks. Success rate versus episode index for Maze (left) and Mastermind (right) when evaluating Qwen3-4B, 8B, 14B trained with Orbit. We report the best-performing checkpoint in terms of success rate of episode 3 within the first 100 training steps to highlight Orbit’s potential.
  • ...and 1 more figures