Table of Contents
Fetching ...

Sibyl: Simple yet Effective Agent Framework for Complex Real-world Reasoning

Yulong Wang, Tianhao Shen, Lifeng Liu, Jian Xie

TL;DR

The paper tackles the challenge of long-horizon real-world reasoning in LLM-based agents, which are often encumbered by design complexity. It introduces Sibyl, a simple, modular framework with a tool planner, external information acquisition channel, a global workspace, and a multi-agent jury, emphasizing stateless QA inference and reentrancy for easier debugging and reuse. Evaluated on the GAIA benchmark with GPT-4o, Sibyl achieves a new state-of-the-art average score and significant improvements on Level 2/3 tasks, demonstrating enhanced long-term reasoning and robustness. The work highlights the value of workspace-based memory, collaborative self-refinement, and debug-friendly design for scalable, real-world AI agents.

Abstract

Existing agents based on large language models (LLMs) demonstrate robust problem-solving capabilities by integrating LLMs' inherent knowledge, strong in-context learning and zero-shot capabilities, and the use of tools combined with intricately designed LLM invocation workflows by humans. However, these agents still exhibit shortcomings in long-term reasoning and under-use the potential of existing tools, leading to noticeable deficiencies in complex real-world reasoning scenarios. To address these limitations, we introduce Sibyl, a simple yet powerful LLM-based agent framework designed to tackle complex reasoning tasks by efficiently leveraging a minimal set of tools. Drawing inspiration from Global Workspace Theory, Sibyl incorporates a global workspace to enhance the management and sharing of knowledge and conversation history throughout the system. Furthermore, guided by Society of Mind Theory, Sibyl implements a multi-agent debate-based jury to self-refine the final answers, ensuring a comprehensive and balanced approach. This approach aims to reduce system complexity while expanding the scope of problems solvable-from matters typically resolved by humans in minutes to those requiring hours or even days, thus facilitating a shift from System-1 to System-2 thinking. Sibyl has been designed with a focus on scalability and ease of debugging by incorporating the concept of reentrancy from functional programming from its inception, with the aim of seamless and low effort integration in other LLM applications to improve capabilities. Our experimental results on the GAIA benchmark test set reveal that the Sibyl agent instantiated with GPT-4 achieves state-of-the-art performance with an average score of 34.55%, compared to other agents based on GPT-4. We hope that Sibyl can inspire more reliable and reusable LLM-based agent solutions to address complex real-world reasoning tasks.

Sibyl: Simple yet Effective Agent Framework for Complex Real-world Reasoning

TL;DR

The paper tackles the challenge of long-horizon real-world reasoning in LLM-based agents, which are often encumbered by design complexity. It introduces Sibyl, a simple, modular framework with a tool planner, external information acquisition channel, a global workspace, and a multi-agent jury, emphasizing stateless QA inference and reentrancy for easier debugging and reuse. Evaluated on the GAIA benchmark with GPT-4o, Sibyl achieves a new state-of-the-art average score and significant improvements on Level 2/3 tasks, demonstrating enhanced long-term reasoning and robustness. The work highlights the value of workspace-based memory, collaborative self-refinement, and debug-friendly design for scalable, real-world AI agents.

Abstract

Existing agents based on large language models (LLMs) demonstrate robust problem-solving capabilities by integrating LLMs' inherent knowledge, strong in-context learning and zero-shot capabilities, and the use of tools combined with intricately designed LLM invocation workflows by humans. However, these agents still exhibit shortcomings in long-term reasoning and under-use the potential of existing tools, leading to noticeable deficiencies in complex real-world reasoning scenarios. To address these limitations, we introduce Sibyl, a simple yet powerful LLM-based agent framework designed to tackle complex reasoning tasks by efficiently leveraging a minimal set of tools. Drawing inspiration from Global Workspace Theory, Sibyl incorporates a global workspace to enhance the management and sharing of knowledge and conversation history throughout the system. Furthermore, guided by Society of Mind Theory, Sibyl implements a multi-agent debate-based jury to self-refine the final answers, ensuring a comprehensive and balanced approach. This approach aims to reduce system complexity while expanding the scope of problems solvable-from matters typically resolved by humans in minutes to those requiring hours or even days, thus facilitating a shift from System-1 to System-2 thinking. Sibyl has been designed with a focus on scalability and ease of debugging by incorporating the concept of reentrancy from functional programming from its inception, with the aim of seamless and low effort integration in other LLM applications to improve capabilities. Our experimental results on the GAIA benchmark test set reveal that the Sibyl agent instantiated with GPT-4 achieves state-of-the-art performance with an average score of 34.55%, compared to other agents based on GPT-4. We hope that Sibyl can inspire more reliable and reusable LLM-based agent solutions to address complex real-world reasoning tasks.
Paper Structure (39 sections, 3 figures, 5 tables)

This paper contains 39 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The overall pipeline of Sibyl framework.
  • Figure 2: Steps used to solve questions for human and Sibyl agent on the GAIA validation set.
  • Figure 3: Average steps needed by human and Sibyl agent on the GAIA validation set.