Table of Contents
Fetching ...

XSkill: Continual Learning from Experience and Skills in Multimodal Agents

Guanyu Jiang, Zhaochen Su, Xiaoye Qu, Yi R., Fung

Abstract

Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.

XSkill: Continual Learning from Experience and Skills in Multimodal Agents

Abstract

Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.
Paper Structure (43 sections, 5 equations, 6 figures, 10 tables)

This paper contains 43 sections, 5 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Comparison of Reasoning Trajectories on a Multimodal Task with and without XSkill. The baseline agent (left) fails due to visual-semantic gaps, neglecting to correct the inverted image or isolate small objects. In contrast, XSkill (right) recalls relevant experiences and links them to structured skill fragments. Through context-aware adaptation, the agent generates a grounded execution plan involving rotation and cropping, leading to successful identification.
  • Figure 2: Overview of the XSkill framework. Phase I (left): The agent accumulates knowledge by distilling structured skill documents (orange dataflow) and experience items (green dataflow) from multi-path trajectories through (A) Rollout Summary, (B) Cross-Rollout Critique, and hierarchical consolidation. Phase II (right): For a test task, the system (C) decomposes it into subtasks and retrieves relevant knowledge, (D) adapts it to the current visual context, and injects it into the prompt of the agent for execution.
  • Figure 3: Error analysis on VisualToolBench using Gemini-2.5-Pro. Error counts (inside bars) and their proportions relative to total tool calls (above bars) are compared across three settings. Skills significantly reduce syntax and runtime errors, leading to more robust tool execution.
  • Figure 4: Performance comparison across different rollout values on VisualToolBench. Rollout $N=0$ corresponds to the baseline with tools (w/ tools). The results show consistent improvement as the number of rollouts increases.
  • Figure 5: Out-of-distribution performance comparison (Average@4) of different methods on TIR-Bench and MMBrowseComp. The gray horizontal dashed line represents the w/ Tools baseline. Our method (highlighted with black border) consistently outperforms all baseline methods across both models and benchmarks.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 2.1: Skill
  • Definition 2.2: Experience