Table of Contents
Fetching ...

Top-k Representative Search for Comparative Tree Summarization

Yuqi Chen, Xin Huang, Bilian Chen

TL;DR

This work addresses the problem of summarizing two weighted trees with identical topology to reveal their similarities and differences by selecting a compact set of k representative nodes split into k1 for similarity and k2 for difference, balanced by a scaling factor $\gamma$. It introduces a distribution-based, Hellinger-distance-inspired measure and self-feature components to guide a greedy SVDT algorithm with a (1−1/e) approximation guarantee, plus a visualization approach for compact presentation. Key contributions include the kVDT problem formulation, a distribution normalization technique for non-leaf nodes, a submodular objective with provable guarantees, and extensive experiments showing superior performance over single-tree baselines on diversity, query closeness, and structural preservation. The method scales to large hierarchies and supports both identical-structure and cross-structure trees, enabling informative comparative visual summaries for hierarchical data.

Abstract

Data summarization aims at utilizing a small-scale summary to represent massive datasets as a whole, which is useful for visualization and information sipped generation. However, most existing studies of hierarchical summarization only work on \emph{one single tree} by selecting $k$ representative nodes, which neglects an important problem of comparative summarization on two trees. In this paper, given two trees with the same topology structure and different node weights, we aim at finding $k$ representative nodes, where $k_1$ nodes summarize the common relationship between them and $k_2$ nodes highlight significantly different sub-trees meanwhile satisfying $k_1+k_2=k$. To optimize summarization results, we introduce a scaling coefficient for balancing the summary view between two sub-trees in terms of similarity and difference. Additionally, we propose a novel definition based on the Hellinger distance to quantify the node distribution difference between the sub-trees. We present a greedy algorithm SVDT to find high-quality results with approximation guaranteed in an efficient way. Furthermore, we explore an extension of our comparative summarization to handle two trees with different structures. Extensive experiments demonstrate the effectiveness and efficiency of our SVDT algorithm against existing summarization competitors.

Top-k Representative Search for Comparative Tree Summarization

TL;DR

This work addresses the problem of summarizing two weighted trees with identical topology to reveal their similarities and differences by selecting a compact set of k representative nodes split into k1 for similarity and k2 for difference, balanced by a scaling factor . It introduces a distribution-based, Hellinger-distance-inspired measure and self-feature components to guide a greedy SVDT algorithm with a (1−1/e) approximation guarantee, plus a visualization approach for compact presentation. Key contributions include the kVDT problem formulation, a distribution normalization technique for non-leaf nodes, a submodular objective with provable guarantees, and extensive experiments showing superior performance over single-tree baselines on diversity, query closeness, and structural preservation. The method scales to large hierarchies and supports both identical-structure and cross-structure trees, enabling informative comparative visual summaries for hierarchical data.

Abstract

Data summarization aims at utilizing a small-scale summary to represent massive datasets as a whole, which is useful for visualization and information sipped generation. However, most existing studies of hierarchical summarization only work on \emph{one single tree} by selecting representative nodes, which neglects an important problem of comparative summarization on two trees. In this paper, given two trees with the same topology structure and different node weights, we aim at finding representative nodes, where nodes summarize the common relationship between them and nodes highlight significantly different sub-trees meanwhile satisfying . To optimize summarization results, we introduce a scaling coefficient for balancing the summary view between two sub-trees in terms of similarity and difference. Additionally, we propose a novel definition based on the Hellinger distance to quantify the node distribution difference between the sub-trees. We present a greedy algorithm SVDT to find high-quality results with approximation guaranteed in an efficient way. Furthermore, we explore an extension of our comparative summarization to handle two trees with different structures. Extensive experiments demonstrate the effectiveness and efficiency of our SVDT algorithm against existing summarization competitors.
Paper Structure (17 sections, 1 theorem, 10 equations, 9 figures, 2 algorithms)

This paper contains 17 sections, 1 theorem, 10 equations, 9 figures, 2 algorithms.

Key Result

Theorem 1

$sum$ is submodular. i.e., for all $S_a$, $S_b$$\subseteq \mathcal{V}$ subject to $S_a \subseteq S_b$, we have $sum(S_a \cup {x})-sum(S_a) \geq sum(S_b \cup {x})-sum(S_b).$

Figures (9)

  • Figure 1: A running example of our problem.
  • Figure 2: An example of passing up the similarity and difference distribution. Here $\beta=1$. Finally, the root $r$ has $Sim_D(r)=[50, 45]$ and $Dif_D(r)=[100, 0]$.
  • Figure 3: Summary visualization based on SVDT answers.
  • Figure 4: Handle two trees' summarization in different structures.
  • Figure 5: Diversity $Div(S)$ evaluation on all datasets.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Definition 1: Differential Weight
  • Definition 2: Scaling Coefficient
  • Theorem 1
  • proof
  • Example 1