Table of Contents
Fetching ...

Online Experiential Learning for Language Models

Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, Furu Wei

Abstract

The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.

Online Experiential Learning for Language Models

Abstract

The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.
Paper Structure (30 sections, 4 equations, 15 figures, 2 tables, 1 algorithm)

This paper contains 30 sections, 4 equations, 15 figures, 2 tables, 1 algorithm.

Figures (15)

  • Figure 1: By iterating over experiential knowledge extraction and consolidation stages of OEL, the model can progressively improve pass rate and efficiency (measured by response length) on the environment, effectively achieving online learning.
  • Figure 2: Offline training vs. online experiential learning.Left: The prevailing offline paradigm trains models at the server side using human annotations (SFT) or simulated environments (RL), operating in a closed world with pre-constructed data. Right: Online experiential learning forms a virtuous cycle during deployment. The model interacts with real environments on the user side, and the resulting test-time experience is used to update the model on the server side, requiring no annotations, no simulated environments, and enabling open-world learning from text feedback.
  • Figure 3: Overview of OEL. On the user side, the model interacts with the real environment to collect multi-turn trajectories. On the server side, transferable experiential knowledge is first extracted from the collected trajectories, then consolidated into model weights via on-policy context distillation. During training, the model performs single-turn rollouts from partial rollout prefixes and is optimized to match a knowledge-conditioned teacher through reverse KL divergence, eliminating the need for user-side environment access. The entire process relies solely on textual environment feedback, requiring no reward model or verifiable reward.
  • Figure 4: By iterating over experiential knowledge extraction and consolidation stages of OEL, the model can progressively improve pass rate, achieving online learning.
  • Figure 5: Normalized response length across OEL rounds. Reasoning becomes more efficient as experiential knowledge is progressively internalized.
  • ...and 10 more figures