Table of Contents
Fetching ...

DataScout: Automatic Data Fact Retrieval for Statement Augmentation with an LLM-Based Agent

Chuer Chen, Yuqi Liu, Danqing Shi, Shixiong Cao, Nan Cao

TL;DR

DataScout tackles the time-intensive challenge of locating data facts to augment data-driven narratives by introducing an LLM-based agent that collaboratively constructs a retrieval tree. The system decomposes queries, searches data via text-to-SQL, and extracts stance-aligned facts using Chain-of-Thought prompting, all visualized in a mind-map retrieval space to support human-AI collaboration. It is evaluated through a formative study and expert interviews, plus three case studies using World Development Indicators, demonstrating meaningful gains in retrieving diverse, stance-aware data facts to bolster credibility and objectivity. The work highlights practical implications for integrating LLMs into data storytelling workflows and points to future improvements in accuracy, data sourcing, visualization diversity, and broader evaluation.

Abstract

A data story typically integrates data facts from multiple perspectives and stances to construct a comprehensive and objective narrative. However, retrieving these facts demands time for data search and challenges the creator's analytical skills. In this work, we introduce DataScout, an interactive system that automatically performs reasoning and stance-based data facts retrieval to augment the user's statement. Particularly, DataScout leverages an LLM-based agent to construct a retrieval tree, enabling collaborative control of its expansion between users and the agent. The interface visualizes the retrieval tree as a mind map that eases users to intuitively steer the retrieval direction and effectively engage in reasoning and analysis. We evaluate the proposed system through case studies and in-depth expert interviews. Our evaluation demonstrates that DataScout can effectively retrieve multifaceted data facts from different stances, helping users verify their statements and enhance the credibility of their stories.

DataScout: Automatic Data Fact Retrieval for Statement Augmentation with an LLM-Based Agent

TL;DR

DataScout tackles the time-intensive challenge of locating data facts to augment data-driven narratives by introducing an LLM-based agent that collaboratively constructs a retrieval tree. The system decomposes queries, searches data via text-to-SQL, and extracts stance-aligned facts using Chain-of-Thought prompting, all visualized in a mind-map retrieval space to support human-AI collaboration. It is evaluated through a formative study and expert interviews, plus three case studies using World Development Indicators, demonstrating meaningful gains in retrieving diverse, stance-aware data facts to bolster credibility and objectivity. The work highlights practical implications for integrating LLMs into data storytelling workflows and points to future improvements in accuracy, data sourcing, visualization diversity, and broader evaluation.

Abstract

A data story typically integrates data facts from multiple perspectives and stances to construct a comprehensive and objective narrative. However, retrieving these facts demands time for data search and challenges the creator's analytical skills. In this work, we introduce DataScout, an interactive system that automatically performs reasoning and stance-based data facts retrieval to augment the user's statement. Particularly, DataScout leverages an LLM-based agent to construct a retrieval tree, enabling collaborative control of its expansion between users and the agent. The interface visualizes the retrieval tree as a mind map that eases users to intuitively steer the retrieval direction and effectively engage in reasoning and analysis. We evaluate the proposed system through case studies and in-depth expert interviews. Our evaluation demonstrates that DataScout can effectively retrieve multifaceted data facts from different stances, helping users verify their statements and enhance the credibility of their stories.

Paper Structure

This paper contains 33 sections, 3 equations, 5 figures.

Figures (5)

  • Figure 1: An overview of the LLM-Based agent. When the agent receives the user's instruction to expand a node, it sequentially executes the query decomposition, data search, and fact extraction processes. The generated sub-queries and retrieved data facts are then returned to the retrieval tree as new child nodes. Based on these child nodes, the agent plans the next node worth expanding to assist the user in decision-making.
  • Figure 2: The interface consists of three major views: ❶ the editor view; ❷ the retrieval space view; ❸ the retrieval details view.
  • Figure 3: A retrieval space (a) and three retrieved data facts (b, c, d) for the statement "Climate change is increasingly impacting global economies and public health in China over the past 10 years". Data facts (b, c) were retrieved to support the statement. Data fact b shows a growing trend of economic damages from particulate emissions, and data fact c displays the number of displacement cases in China caused by disasters in 2021. Data fact d, retrieved to oppose the statement, shows an upward trend in China's adjusted net national income per capita.
  • Figure 4: A retrieval space (a) and two retrieved data facts (b, c) for the statement "In the past decade, the level of education for Chinese women has increased, yet the gap in labor force participation of the gender in the market remains significant in China". Data facts (b, c) were retrieved to support the statement. Data fact (b) shows the associations in labor force participation rates of men and women in China, and data fact c shows an increased percentage of Chinese women enrolling in tertiary education from 2013 to 2022.
  • Figure 5: A retrieval space (a) and three retrieved data facts (b, c, d) for the statement "The aging population in Japan is intensifying, putting immense pressure on public finances and social security systems". Data facts (b, c, d) were retrieved to support the statement. Data fact b highlights the increasing proportion of Japan's elderly population (aged 65 and above) over the past decade, while data fact c displays the number of individuals aged 65 in 2023. Data fact d reveals a declining trend in Japan's birth rate.