Table of Contents
Fetching ...

ELITE: Experiential Learning and Intent-Aware Transfer for Self-improving Embodied Agents

Bingqing Wei, Zhongyu Xia, Dingai Liu, Xiaoyu Zhou, Zhiwei Lin, Yongtao Wang

Abstract

Vision-language models (VLMs) have shown remarkable general capabilities, yet embodied agents built on them fail at complex tasks, often skipping critical steps, proposing invalid actions, and repeating mistakes. These failures arise from a fundamental gap between the static training data of VLMs and the physical interaction for embodied tasks. VLMs can learn rich semantic knowledge from static data but lack the ability to interact with the world. To address this issue, we introduce ELITE, an embodied agent framework with {E}xperiential {L}earning and {I}ntent-aware {T}ransfer that enables agents to continuously learn from their own environment interaction experiences, and transfer acquired knowledge to procedurally similar tasks. ELITE operates through two synergistic mechanisms, \textit{i.e.,} self-reflective knowledge construction and intent-aware retrieval. Specifically, self-reflective knowledge construction extracts reusable strategies from execution trajectories and maintains an evolving strategy pool through structured refinement operations. Then, intent-aware retrieval identifies relevant strategies from the pool and applies them to current tasks. Experiments on the EB-ALFRED and EB-Habitat benchmarks show that ELITE achieves 9\% and 5\% performance improvement over base VLMs in the online setting without any supervision. In the supervised setting, ELITE generalizes effectively to unseen task categories, achieving better performance compared to state-of-the-art training-based methods. These results demonstrate the effectiveness of ELITE for bridging the gap between semantic understanding and reliable action execution.

ELITE: Experiential Learning and Intent-Aware Transfer for Self-improving Embodied Agents

Abstract

Vision-language models (VLMs) have shown remarkable general capabilities, yet embodied agents built on them fail at complex tasks, often skipping critical steps, proposing invalid actions, and repeating mistakes. These failures arise from a fundamental gap between the static training data of VLMs and the physical interaction for embodied tasks. VLMs can learn rich semantic knowledge from static data but lack the ability to interact with the world. To address this issue, we introduce ELITE, an embodied agent framework with {E}xperiential {L}earning and {I}ntent-aware {T}ransfer that enables agents to continuously learn from their own environment interaction experiences, and transfer acquired knowledge to procedurally similar tasks. ELITE operates through two synergistic mechanisms, \textit{i.e.,} self-reflective knowledge construction and intent-aware retrieval. Specifically, self-reflective knowledge construction extracts reusable strategies from execution trajectories and maintains an evolving strategy pool through structured refinement operations. Then, intent-aware retrieval identifies relevant strategies from the pool and applies them to current tasks. Experiments on the EB-ALFRED and EB-Habitat benchmarks show that ELITE achieves 9\% and 5\% performance improvement over base VLMs in the online setting without any supervision. In the supervised setting, ELITE generalizes effectively to unseen task categories, achieving better performance compared to state-of-the-art training-based methods. These results demonstrate the effectiveness of ELITE for bridging the gap between semantic understanding and reliable action execution.
Paper Structure (28 sections, 5 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 28 sections, 5 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the ELITE framework. The framework comprises two synergistic mechanisms: (1) Self-reflective Knowledge Construction, which distills reusable knowledge from execution trajectories and maintains the strategy pool through structured refinement operations; and (2) Intent-aware Retrieval, which embeds the coarse plan and retrieves procedurally similar strategies from an evolving strategy pool to augment task planning. This closed-loop architecture enables continuous improvement through physical interaction without supervision.
  • Figure 2: Ablation study on ELITE components in the online setting on EB-ALFRED. We compare the full ELITE framework against variants without Intent-Aware Retrieval (w/o IAR), without Context Consolidation (w/o CC), and the Base Model (Qwen2.5-VL-72B). The full ELITE achieves the best results across all task categories.
  • Figure 3: Ablation study on retrieval mechanisms in the online setting on EB-ALFRED long-horizon tasks. Task progress denotes the average completion percentage across all long-horizon tasks. Our intent-aware (CoT) retrieval outperforms TF-IDF-based alternatives (strategy content and instruction similarity), as well as using all strategies in the strategy pool and random selection.
  • Figure 4: Illustration of learning dynamics of ELITE in the online setting on EB-ALFRED long-horizon tasks. The $x$-axis indicates the number of tasks processed for online learning, while the $y$-axis shows the average success rate and task progress across all long-horizon tasks. Both metrics improve consistently as the strategy pool accumulates experience, demonstrating continuous self-improvement through deployment experience.
  • Figure 5: Qualitative comparison between the base Qwen2.5-VL model and ELITE on the example task: "Put a clean plate on the counter."Left: The base model generates a flawed plan that attempts to clean the plate without first placing it in the sink, resulting in task failure. Middle: ELITE retrieves procedurally similar strategies from past experiences (sim=0.80 and 0.75) that demonstrate the correct pattern of putting objects in the sink before cleaning. Right: Augmented with retrieved strategies, ELITE produces a corrected plan that properly places the plate in the sink before turning on the faucet, leading to successful task completion.