Table of Contents
Fetching ...

O3D: Offline Data-driven Discovery and Distillation for Sequential Decision-Making with Large Language Models

Yuchen Xiao, Yanchao Sun, Mengda Xu, Udari Madhushani, Jared Vann, Deepeka Garg, Sumitra Ganesh

TL;DR

O3D tackles the challenge of using large language models for long-horizon sequential decision-making by proposing an offline data-driven framework that discovers reusable skills and distills cross-task knowledge without finetuning. Through a three-stage pipeline—offline skill discovery, distillation of primitives and improvement tips, and hierarchical policy execution—O3D leverages large-scale offline interaction data to improve LLM-powered policies across multiple tasks. Empirical results on ALFWorld and WebShop show consistent improvements over strong baselines like ReAct and Reflexion, with additional gains when extending to code-based policies (O3D-Code). The work demonstrates that offline data can unlock cross-task generalization and efficiency for LLM-driven agents, offering a practical path to scalable, robust sequential decision-making without costly online training.

Abstract

Recent advancements in large language models (LLMs) have exhibited promising performance in solving sequential decision-making problems. By imitating few-shot examples provided in the prompts (i.e., in-context learning), an LLM agent can interact with an external environment and complete given tasks without additional training. However, such few-shot examples are often insufficient to generate high-quality solutions for complex and long-horizon tasks, while the limited context length cannot consume larger-scale demonstrations with long interaction horizons. To this end, we propose an offline learning framework that utilizes offline data at scale (e.g, logs of human interactions) to improve LLM-powered policies without finetuning. The proposed method O3D (Offline Data-driven Discovery and Distillation) automatically discovers reusable skills and distills generalizable knowledge across multiple tasks based on offline interaction data, advancing the capability of solving downstream tasks. Empirical results under two interactive decision-making benchmarks (ALFWorld and WebShop) verify that O3D can notably enhance the decision-making capabilities of LLMs through the offline discovery and distillation process, and consistently outperform baselines across various LLMs.

O3D: Offline Data-driven Discovery and Distillation for Sequential Decision-Making with Large Language Models

TL;DR

O3D tackles the challenge of using large language models for long-horizon sequential decision-making by proposing an offline data-driven framework that discovers reusable skills and distills cross-task knowledge without finetuning. Through a three-stage pipeline—offline skill discovery, distillation of primitives and improvement tips, and hierarchical policy execution—O3D leverages large-scale offline interaction data to improve LLM-powered policies across multiple tasks. Empirical results on ALFWorld and WebShop show consistent improvements over strong baselines like ReAct and Reflexion, with additional gains when extending to code-based policies (O3D-Code). The work demonstrates that offline data can unlock cross-task generalization and efficiency for LLM-driven agents, offering a practical path to scalable, robust sequential decision-making without costly online training.

Abstract

Recent advancements in large language models (LLMs) have exhibited promising performance in solving sequential decision-making problems. By imitating few-shot examples provided in the prompts (i.e., in-context learning), an LLM agent can interact with an external environment and complete given tasks without additional training. However, such few-shot examples are often insufficient to generate high-quality solutions for complex and long-horizon tasks, while the limited context length cannot consume larger-scale demonstrations with long interaction horizons. To this end, we propose an offline learning framework that utilizes offline data at scale (e.g, logs of human interactions) to improve LLM-powered policies without finetuning. The proposed method O3D (Offline Data-driven Discovery and Distillation) automatically discovers reusable skills and distills generalizable knowledge across multiple tasks based on offline interaction data, advancing the capability of solving downstream tasks. Empirical results under two interactive decision-making benchmarks (ALFWorld and WebShop) verify that O3D can notably enhance the decision-making capabilities of LLMs through the offline discovery and distillation process, and consistently outperform baselines across various LLMs.
Paper Structure (44 sections, 2 equations, 12 figures, 10 tables, 2 algorithms)

This paper contains 44 sections, 2 equations, 12 figures, 10 tables, 2 algorithms.

Figures (12)

  • Figure 1: The proposed O3D framework. Stage 1: offline skill discovery and data segmentation. Stage 2: offline policy improvement with knowledge distillation. Stage 3: downstream interaction with hierarchical policy execution.
  • Figure 2: Prompt examples for the base policy and a skill-conditioned policy in hierarchical policy execution.
  • Figure 3: Examples of discovered skills with primitives and distilled knowledge under ALFWorld.
  • Figure 4: Comparison on averaged success rate over GPT models between using tips distilled via contrastive and non-contrastive (NC) methods, with examples in green and pink boxes respectively.
  • Figure 5: Comparison on success rate (SR) with three variants of O3D against the baseline method.
  • ...and 7 more figures