Table of Contents
Fetching ...

LINX: A Language Driven Generative System for Goal-Oriented Automated Data Exploration

Tavor Lipman, Tova Milo, Amit Somech, Tomer Wolfson, Oz Zafar

TL;DR

LINX tackles the limitations of prior Automated Data Exploration by enabling goal-oriented session generation from a natural-language analytical goal. It combines an LLM-driven specification derivation (NL2PD2LDX) with a modular ADE engine built on Constrained Deep Reinforcement Learning (CDRL) that enforces exploration specifications via an LDX verification engine and a specification-aware network. Key contributions include the LDX language and its verifier, a two-stage NL2PD2LDX pipeline to derive executable exploration plans, a novel CDRL framework with end-of-session and immediate compliance rewards, and a large benchmark (182 goal–LDX pairs) plus a user study showing LINX outperforms ChatGPT, ATENA, and Google Sheets in relevance and practical insight. The approach yields personalized, goal-relevant notebooks that better support analysts in goal-driven data exploration and offers a scalable path toward broader ADE capabilities, including future integration with auto-visualizations and interactive analysis.

Abstract

Data exploration is a challenging process in which users examine a dataset by iteratively employing a series of queries. While in some cases the user explores a new dataset to become familiar with it, more often, the exploration process is conducted with a specific analysis goal or question in mind. To assist users in exploring a new dataset, Automated Data Exploration (ADE) systems have been devised in previous work. These systems aim to auto-generate a full exploration session, containing a sequence of queries that showcase interesting elements of the data. However, existing ADE systems are often constrained by a predefined objective function, thus always generating the same session for a given dataset. Therefore, their effectiveness in goal-oriented exploration, in which users need to answer specific questions about the data, are extremely limited. To this end, this paper presents LINX, a generative system augmented with a natural language interface for goal-oriented ADE. Given an input dataset and an analytical goal described in natural language, LINX generates a personalized exploratory session that is relevant to the user's goal. LINX utilizes a Large Language Model (LLM) to interpret the input analysis goal, and then derive a set of specifications for the desired output exploration session. These specifications are then transferred to a novel, modular ADE engine based on Constrained Deep Reinforcement Learning (CDRL), which can adapt its output according to the specified instructions. To validate LINX's effectiveness, we introduce a new benchmark dataset for goal-oriented exploration and conduct an extensive user study. Our analysis underscores LINX's superior capability in producing exploratory notebooks that are significantly more relevant and beneficial than those generated by existing solutions, including ChatGPT, goal-agnostic ADE, and commercial systems.

LINX: A Language Driven Generative System for Goal-Oriented Automated Data Exploration

TL;DR

LINX tackles the limitations of prior Automated Data Exploration by enabling goal-oriented session generation from a natural-language analytical goal. It combines an LLM-driven specification derivation (NL2PD2LDX) with a modular ADE engine built on Constrained Deep Reinforcement Learning (CDRL) that enforces exploration specifications via an LDX verification engine and a specification-aware network. Key contributions include the LDX language and its verifier, a two-stage NL2PD2LDX pipeline to derive executable exploration plans, a novel CDRL framework with end-of-session and immediate compliance rewards, and a large benchmark (182 goal–LDX pairs) plus a user study showing LINX outperforms ChatGPT, ATENA, and Google Sheets in relevance and practical insight. The approach yields personalized, goal-relevant notebooks that better support analysts in goal-driven data exploration and offers a scalable path toward broader ADE capabilities, including future integration with auto-visualizations and interactive analysis.

Abstract

Data exploration is a challenging process in which users examine a dataset by iteratively employing a series of queries. While in some cases the user explores a new dataset to become familiar with it, more often, the exploration process is conducted with a specific analysis goal or question in mind. To assist users in exploring a new dataset, Automated Data Exploration (ADE) systems have been devised in previous work. These systems aim to auto-generate a full exploration session, containing a sequence of queries that showcase interesting elements of the data. However, existing ADE systems are often constrained by a predefined objective function, thus always generating the same session for a given dataset. Therefore, their effectiveness in goal-oriented exploration, in which users need to answer specific questions about the data, are extremely limited. To this end, this paper presents LINX, a generative system augmented with a natural language interface for goal-oriented ADE. Given an input dataset and an analytical goal described in natural language, LINX generates a personalized exploratory session that is relevant to the user's goal. LINX utilizes a Large Language Model (LLM) to interpret the input analysis goal, and then derive a set of specifications for the desired output exploration session. These specifications are then transferred to a novel, modular ADE engine based on Constrained Deep Reinforcement Learning (CDRL), which can adapt its output according to the specified instructions. To validate LINX's effectiveness, we introduce a new benchmark dataset for goal-oriented exploration and conduct an extensive user study. Our analysis underscores LINX's superior capability in producing exploratory notebooks that are significantly more relevant and beneficial than those generated by existing solutions, including ChatGPT, goal-agnostic ADE, and commercial systems.
Paper Structure (28 sections, 2 equations, 9 figures, 4 tables, 2 algorithms)

This paper contains 28 sections, 2 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: An Example LINXWorkflow for Auto-Generating Goal-Oriented Exploration Sessions
  • Figure 2: Specification-Aware Network Architecture
  • Figure 3: Examples of the chained prompts: (1) NL to non-executable Pandas code , and (2) Pandas code to LDX
  • Figure 4: Benchmark Dataset Generation
  • Figure 5: User Study -- Relevance Rating of Exploration Notebooks to the Given Goal
  • ...and 4 more figures

Theorems & Definitions (1)

  • definition 1: LDX Assignment