Table of Contents
Fetching ...

Data-Semantics-Aware Recommendation of Diverse Pivot Tables

Whanhee Cho, Anna Fariha

TL;DR

The paper addresses automatic, diverse pivot-table recommendations for high-dimensional datasets by proposing SAGE, a data-semantics-aware system. It defines a joint utility model combining insightfulness (semantic and statistical signals) and interpretability, and introduces a semantics-based diversity measure to ensure non-redundant pivot tables. A greedy algorithm with aggressive pruning and an LLM-proxy enables scalable generation of a diverse, high-utility pivot-table set under a budget, demonstrated across four real datasets and via a user study. Results show SAGE outperforms baselines in utility and diversity while remaining scalable, adaptable to user feedback, and capable of highlighting nontrivial insights beyond standard spreadsheet recommendations. The work advances pivot-table summarization by integrating data semantics, interpretability, and diversity into a unified, efficient framework with practical implications for data exploration in spreadsheets and beyond.

Abstract

Data summarization is essential to discover insights from large datasets. In a spreadsheets, pivot tables offer a convenient way to summarize tabular data by computing aggregates over some attributes, grouped by others. However, identifying attribute combinations that will result in useful pivot tables remains a challenge, especially for high-dimensional datasets. We formalize the problem of automatically recommending insightful and interpretable pivot tables, eliminating the tedious manual process. A crucial aspect of recommending a set of pivot tables is to diversify them. Traditional works inadequately address the table-diversification problem, which leads us to consider the problem of pivot table diversification. We present SAGE, a data-semantics-aware system for recommending k-budgeted diverse pivot tables, overcoming the shortcomings of prior work for top-k recommendations that cause redundancy. SAGE ensures that each pivot table is insightful, interpretable, and adaptive to the user's actions and preferences, while also guaranteeing that the set of pivot tables are different from each other, offering a diverse recommendation. We make two key technical contributions: (1) a data-semantics-aware model to measure the utility of a single pivot table and the diversity of a set of pivot tables, and (2) a scalable greedy algorithm that can efficiently select a set of diverse pivot tables of high utility, by leveraging data semantics to significantly reduce the combinatorial search space. Our extensive experiments on three real-world datasets show that SAGE outperforms alternative approaches, and efficiently scales to accommodate high-dimensional datasets. Additionally, we present several case studies to highlight SAGE's qualitative effectiveness over commercial software and Large Language Models (LLMs).

Data-Semantics-Aware Recommendation of Diverse Pivot Tables

TL;DR

The paper addresses automatic, diverse pivot-table recommendations for high-dimensional datasets by proposing SAGE, a data-semantics-aware system. It defines a joint utility model combining insightfulness (semantic and statistical signals) and interpretability, and introduces a semantics-based diversity measure to ensure non-redundant pivot tables. A greedy algorithm with aggressive pruning and an LLM-proxy enables scalable generation of a diverse, high-utility pivot-table set under a budget, demonstrated across four real datasets and via a user study. Results show SAGE outperforms baselines in utility and diversity while remaining scalable, adaptable to user feedback, and capable of highlighting nontrivial insights beyond standard spreadsheet recommendations. The work advances pivot-table summarization by integrating data semantics, interpretability, and diversity into a unified, efficient framework with practical implications for data exploration in spreadsheets and beyond.

Abstract

Data summarization is essential to discover insights from large datasets. In a spreadsheets, pivot tables offer a convenient way to summarize tabular data by computing aggregates over some attributes, grouped by others. However, identifying attribute combinations that will result in useful pivot tables remains a challenge, especially for high-dimensional datasets. We formalize the problem of automatically recommending insightful and interpretable pivot tables, eliminating the tedious manual process. A crucial aspect of recommending a set of pivot tables is to diversify them. Traditional works inadequately address the table-diversification problem, which leads us to consider the problem of pivot table diversification. We present SAGE, a data-semantics-aware system for recommending k-budgeted diverse pivot tables, overcoming the shortcomings of prior work for top-k recommendations that cause redundancy. SAGE ensures that each pivot table is insightful, interpretable, and adaptive to the user's actions and preferences, while also guaranteeing that the set of pivot tables are different from each other, offering a diverse recommendation. We make two key technical contributions: (1) a data-semantics-aware model to measure the utility of a single pivot table and the diversity of a set of pivot tables, and (2) a scalable greedy algorithm that can efficiently select a set of diverse pivot tables of high utility, by leveraging data semantics to significantly reduce the combinatorial search space. Our extensive experiments on three real-world datasets show that SAGE outperforms alternative approaches, and efficiently scales to accommodate high-dimensional datasets. Additionally, we present several case studies to highlight SAGE's qualitative effectiveness over commercial software and Large Language Models (LLMs).

Paper Structure

This paper contains 55 sections, 17 equations, 20 figures, 5 tables, 2 algorithms.

Figures (20)

  • Figure 1: A sample table from an employee compensation dataset.
  • Figure 3: Google Sheets recommendations are redundant and convoluted; Microsoft Excel includes meaningless recommendations such as SUM(Age); while ChatGPT recommendations look reasonable, they are data-content-unaware as the pivot table values are hallucinated.
  • Figure 4: SAGE satisfies all desirable properties. While PowerBI, Tableau, DAISY, and AutoSuggest do not directly/always recommend pivot tables, we include them since they recommend summaries. Code for DAISY and AutoSuggest is unavailable, thus we rely on the papers, and mark some things as unknown. LLMs are not designed to directly recommend pivot tables but can be prompted to do so.
  • Figure 5: Four pivot tables over the dataset of Figure \ref{['tab:salary_database']}. While (a) and (b) indicate gender-based salary gap, (c) and (d) add additional context.
  • Figure 6: Table of notations. We use bold letters to denote sets.
  • ...and 15 more figures

Theorems & Definitions (17)

  • Example 1.1
  • Example 1.2
  • Example 2.1
  • definition 1: Pivot Table
  • Example 2.2
  • Example 2.3
  • Example 3.1
  • Example 3.2
  • Example 3.3
  • Example 3.4
  • ...and 7 more