Table of Contents
Fetching ...

DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning

Xiaochuan Liu, Yuanfeng Song, Xiaoming Yin, Xing Chen

TL;DR

DataSage tackles automated data insight discovery by integrating three core innovations—external knowledge retrieval, divergent-convergent multi-role questioning, and multi-path reasoning for code generation—within a modular, multi-agent framework. It demonstrates consistent performance gains over state-of-the-art baselines on InsightBench, especially on harder tasks, and yields higher-quality plots and actionable insights. The RAKG component offers efficiency by enabling on-demand retrieval without sacrificing accuracy, while the combined questioning and reasoning components deepen analytical depth and robustness. Overall, the work advances automated data analysis by better leveraging domain knowledge and ensuring reliable, interpretable code-driven insights.

Abstract

In today's data-driven era, fully automated end-to-end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM-driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi-agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi-role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi-path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.

DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning

TL;DR

DataSage tackles automated data insight discovery by integrating three core innovations—external knowledge retrieval, divergent-convergent multi-role questioning, and multi-path reasoning for code generation—within a modular, multi-agent framework. It demonstrates consistent performance gains over state-of-the-art baselines on InsightBench, especially on harder tasks, and yields higher-quality plots and actionable insights. The RAKG component offers efficiency by enabling on-demand retrieval without sacrificing accuracy, while the combined questioning and reasoning components deepen analytical depth and robustness. Overall, the work advances automated data analysis by better leveraging domain knowledge and ensuring reliable, interpretable code-driven insights.

Abstract

In today's data-driven era, fully automated end-to-end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM-driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi-agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi-role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi-path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.

Paper Structure

This paper contains 32 sections, 15 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustration of three critical limitations in current data insight agents: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation.
  • Figure 2: An illustration of DataSage, a multi-agent framework for insights discovery tasks. The framework consists of four key modules that work in an iterative QA loop. Dataset Description Module provides the structured description of datasets. RAKG Module retrieves and synthesizes external knowledge. Question Raising Module raises high-quality analytical questions through a divergent-convergent process. Finally, Insights Generation Module transforms questions into the final insights.
  • Figure 3: Comparison of the plot quality across different models. DataSage can generate noticeably higher-quality plots than the baseline AgentPoirot, attributing to our design of the Insights Generation Module.
  • Figure 4: Comparison of diversity and coverage of the questions raised by DataSage and the best baseline AgentPoirot. Even when generating fewer questions (6 vs. 12), DataSage produces questions with significantly higher diversity and coverage, enabling the generation of deeper and more diverse insights.
  • Figure 5: Additional ablation study results. Removing Multimodal Insight Interpretation (MII), Plot Reviewer (PR), or Code Refinement (CR) consistently degrades performance, confirming the necessity of these components.
  • ...and 2 more figures