Table of Contents
Fetching ...

Infant Agent: A Tool-Integrated, Logic-Driven Agent with Cost-Effective API Usage

Bin Lei, Yuchen Li, Yiming Zeng, Tao Ren, Yi Luo, Tianyu Shi, Zitian Gao, Zeyu Hu, Weitai Kang, Qiuwu Chen

TL;DR

The Infant Agent is developed, integrating task-aware functions, operators, a hierarchical management system, and a memory retrieval mechanism that enable large language models to sustain extended reasoning processes and handle complex, multi-step tasks efficiently, all while significantly reducing API costs.

Abstract

Despite the impressive capabilities of large language models (LLMs), they currently exhibit two primary limitations, \textbf{\uppercase\expandafter{\romannumeral 1}}: They struggle to \textbf{autonomously solve the real world engineering problem}. \textbf{\uppercase\expandafter{\romannumeral 2}}: They remain \textbf{challenged in reasoning through complex logic problems}. To address these challenges, we developed the \textsc{Infant Agent}, integrating task-aware functions, operators, a hierarchical management system, and a memory retrieval mechanism. Together, these components enable large language models to sustain extended reasoning processes and handle complex, multi-step tasks efficiently, all while significantly reducing API costs. Using the \textsc{Infant Agent}, GPT-4o's accuracy on the SWE-bench-lite dataset rises from $\mathbf{0.33\%}$ to $\mathbf{30\%}$, and in the AIME-2024 mathematics competition, it increases GPT-4o's accuracy from $\mathbf{13.3\%}$ to $\mathbf{37\%}$.

Infant Agent: A Tool-Integrated, Logic-Driven Agent with Cost-Effective API Usage

TL;DR

The Infant Agent is developed, integrating task-aware functions, operators, a hierarchical management system, and a memory retrieval mechanism that enable large language models to sustain extended reasoning processes and handle complex, multi-step tasks efficiently, all while significantly reducing API costs.

Abstract

Despite the impressive capabilities of large language models (LLMs), they currently exhibit two primary limitations, \textbf{\uppercase\expandafter{\romannumeral 1}}: They struggle to \textbf{autonomously solve the real world engineering problem}. \textbf{\uppercase\expandafter{\romannumeral 2}}: They remain \textbf{challenged in reasoning through complex logic problems}. To address these challenges, we developed the \textsc{Infant Agent}, integrating task-aware functions, operators, a hierarchical management system, and a memory retrieval mechanism. Together, these components enable large language models to sustain extended reasoning processes and handle complex, multi-step tasks efficiently, all while significantly reducing API costs. Using the \textsc{Infant Agent}, GPT-4o's accuracy on the SWE-bench-lite dataset rises from to , and in the AIME-2024 mathematics competition, it increases GPT-4o's accuracy from to .

Paper Structure

This paper contains 18 sections, 4 equations, 21 figures, 5 tables.

Figures (21)

  • Figure 1: Overall performance summary. : Real world engineering problem. : Infant Agent. : OpenHands CodeActAgent. : Logic Reasoning problem. : API token cost. : API token.
  • Figure 2: The overall architecture of Infant Agent.: Request from the User. : Brain level agent. : Hand level agent.
  • Figure 3: Hierarchical Agent Collaboration System. : Brain level agent. : File editor. : Browser. : Code agent. : Mouse/keyboard operation. : Music. : Data analysis.
  • Figure 4: Overview of Memory Storage, Retrieval, and Generation.
  • Figure 5: Differences between the File-Editing commands of Infant-AI and SWE-Agent. : Original file content. : Command generated by the Agent. : Final modified file content generated by SWE-Agent. : Final modified file content generated by Infant Agent. : Modification process of the file by Infant Agent. : Modification process of the file by SWE-Agent.
  • ...and 16 more figures