Table of Contents
Fetching ...

VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought

Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki

TL;DR

ICAL introduces a multimodal, self-reflective learning paradigm for VLM agents that distills noisy trajectories into generalizable programs of thought. By separating abstraction into task/causal, state changes, subgoals, and state representations, and by iterating with human feedback, ICAL builds a growing memory of high-quality examples used for retrieval-augmented prompting and fine-tuning. Empirically, ICAL achieves state-of-the-art results across TEACh, VisualWebArena, and Ego4D, while reducing human effort and data requirements and enabling more efficient continual learning. The work highlights strong cross-domain transfer, scalability, and the potential for reduced expert supervision in embodied AI tasks, while outlining limitations and avenues for future refinement.

Abstract

Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot learning but require high-quality demonstrations. We propose In-Context Abstraction Learning (ICAL), enabling VLM agents to transform suboptimal trajectories into high-quality training data through self-reflection and human feedback. Given imperfect task demonstrations, a VLM abstracts trajectories into generalized strategies and action annotations by correcting inefficiencies and annotating cognitive abstractions: causal relationships, object state changes, temporal subgoals, and task-relevant visual elements. These annotations are iteratively refined through human feedback during execution in similar environments. The resulting examples significantly improve decision-making when used for retrieval-augmented generation or fine-tuning. As the agent's example library grows, it becomes more efficient at abstracting new examples, requiring less human feedback and fewer environment interactions. ICAL achieves state-of-the-art results across multiple benchmarks. In TEACh dialogue-based instruction following, combining fine-tuning and retrieval on ICAL examples outperforms raw human demonstrations and expert examples by 17.5% in goal-condition success. In VisualWebArena, retrieval-augmented GPT-4V with ICAL improves task success 1.6x, while fine-tuned Qwen2-VL achieves 2.8x improvement over the base model. In Ego4D action forecasting, we surpass few-shot GPT-4V and remain competitive with supervised models. Our approach scales 2x better than raw demonstrations and significantly reduces manual prompt engineering requirements.

VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought

TL;DR

ICAL introduces a multimodal, self-reflective learning paradigm for VLM agents that distills noisy trajectories into generalizable programs of thought. By separating abstraction into task/causal, state changes, subgoals, and state representations, and by iterating with human feedback, ICAL builds a growing memory of high-quality examples used for retrieval-augmented prompting and fine-tuning. Empirically, ICAL achieves state-of-the-art results across TEACh, VisualWebArena, and Ego4D, while reducing human effort and data requirements and enabling more efficient continual learning. The work highlights strong cross-domain transfer, scalability, and the potential for reduced expert supervision in embodied AI tasks, while outlining limitations and avenues for future refinement.

Abstract

Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot learning but require high-quality demonstrations. We propose In-Context Abstraction Learning (ICAL), enabling VLM agents to transform suboptimal trajectories into high-quality training data through self-reflection and human feedback. Given imperfect task demonstrations, a VLM abstracts trajectories into generalized strategies and action annotations by correcting inefficiencies and annotating cognitive abstractions: causal relationships, object state changes, temporal subgoals, and task-relevant visual elements. These annotations are iteratively refined through human feedback during execution in similar environments. The resulting examples significantly improve decision-making when used for retrieval-augmented generation or fine-tuning. As the agent's example library grows, it becomes more efficient at abstracting new examples, requiring less human feedback and fewer environment interactions. ICAL achieves state-of-the-art results across multiple benchmarks. In TEACh dialogue-based instruction following, combining fine-tuning and retrieval on ICAL examples outperforms raw human demonstrations and expert examples by 17.5% in goal-condition success. In VisualWebArena, retrieval-augmented GPT-4V with ICAL improves task success 1.6x, while fine-tuned Qwen2-VL achieves 2.8x improvement over the base model. In Ego4D action forecasting, we surpass few-shot GPT-4V and remain competitive with supervised models. Our approach scales 2x better than raw demonstrations and significantly reduces manual prompt engineering requirements.
Paper Structure (57 sections, 14 equations, 9 figures, 9 tables, 2 algorithms)

This paper contains 57 sections, 14 equations, 9 figures, 9 tables, 2 algorithms.

Figures (9)

  • Figure 1: ICAL transforms raw experience into useful programs of thought for in-context learning. Top: Given a noisy trajectory, It prompts a VLM to optimize actions and add language annotations. The optimized trajectory is executed, incorporating human feedback on failures. Successful examples are stored for future VLM in-context action generation. Bottom: An example of the raw, noisy trajectory (left), and the final abstracted example after ICAL (right).
  • Figure 2: After the ICAL examples have been learned, ICAL is deployed for new tasks and environments using retrieval-augmented generation.
  • Figure 3: ICAL enables greater success on training tasks. Tasks successfully completed by ICAL over number of interactions when using the ICAL method with kinesthetic or visual demonstrations, and when replaying the kinesthetic or visual demonstrations directly.
  • Figure 4: TEACh validation unseen success rate for ICAL with increasing number of exemplars. ICAL continually learns without forgetting, scaling 2x better than the unchanged visual human demos used to seed ICAL learning. $\CIRCLE$ denotes task success, while x denotes goal-condition success.
  • Figure 5: ICAL improves learning efficiency as more examples are added to memory.First half (blue) versus second half (orange) of ICAL learning across tasks (left) and for each task type separately (right) in TEACh. The second half of ICAL learning requires significantly fewer environment steps (436±88 vs. 267±43, p=0.0143) and human feedbacks per episode (0.74±0.17 vs. 0.21±0.08, p=0.0089). This indicates that retrieving ICAL examples during learning is beneficial, reducing both human effort and environment interaction over time.
  • ...and 4 more figures