Table of Contents
Fetching ...

GEMS: Agent-Native Multimodal Generation with Memory and Skills

Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, Yang Yang

Abstract

Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.

GEMS: Agent-Native Multimodal Generation with Memory and Skills

Abstract

Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.

Paper Structure

This paper contains 33 sections, 8 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Overall Performance of GEMS with Z-Image-Turbo. GEMS enables a lightweight, distilled 6B model Z-Image-Turbo to outperform prominent closed-source models, such as Nano Banana and GPT-Image, across various mainstream benchmarks.
  • Figure 2: The system architecture of GEMS. The framework consists of three primary pillars: Agent Loop, Agent Memory, and Agent Skill. The user prompt is augmented with domain-specific expertise from Agent Skill, and then iteratively refined within the Agent Loop, with Agent Memory managing the historical context to guide the generation process.
  • Figure 3: Architecture of the Agent Skill system, highlighting its scalable and on-demand nature.
  • Figure 4: Ablation study on GenEval2 with Z-Image-Turbo. (Left) Performance gains contributed by individual components, including Agent Loop, Agent Memory, and Agent Skill. (Right) Detailed analysis of the performance improvements of Agent Memory.
  • Figure 5: Average passed criteria over iterations.
  • ...and 12 more figures