Table of Contents
Fetching ...

Diversity Augmentation of Dynamic User Preference Data for Boosting Personalized Text Summarizers

Parthiv Chatterjee, Shivam Sonawane, Amey Hengle, Aditya Tanna, Sourish Dasgupta, Tanmoy Chakraborty

TL;DR

This work tackles the scarcity of training data for dynamic, personalized summarization by introducing PerAugy, a cross-trajectory data augmentation method operating on a User-Interaction Graph (UIG). Through Double Shuffling (DS) and Stochastic Markovian Perturbation (SMP), PerAugy generates diverse, coherent synthetic user trajectories, significantly boosting encoder accuracy across multiple baselines and improving downstream personalization in frameworks like PENS and GTP. The paper also introduces novelty in evaluating dataset diversity with metrics such as TP, RTC, and DegreeD, finding a strong correlation between DegreeD and user-encoder performance. Cross-domain experiments demonstrate PerAugy’s generalizability to low-resource domains (e.g., OpenAI Reddit), suggesting practical benefits for real-world personalized summarization systems. Overall, PerAugy provides a principled, scalable approach to diversify training data for dynamic user preferences, yielding substantial improvements in personalization quality and robustness.

Abstract

Document summarization enables efficient extraction of user-relevant content but is inherently shaped by individual subjectivity, making it challenging to identify subjective salient information in multifaceted documents. This complexity underscores the necessity for personalized summarization. However, training models for personalized summarization has so far been challenging, particularly because diverse training data containing both user preference history (i.e., click-skip trajectory) and expected (gold-reference) summaries are scarce. The MS/CAS PENS dataset is a valuable resource but includes only preference history without target summaries, preventing end-to-end supervised learning, and its limited topic-transition diversity further restricts generalization. To address this, we propose $\mathrm{PerAugy}$, a novel cross-trajectory shuffling and summary-content perturbation based data augmentation technique that significantly boosts the accuracy of four state-of-the-art baseline (SOTA) user-encoders commonly used in personalized summarization frameworks (best result: $\text{0.132}$$\uparrow$ w.r.t AUC). We select two such SOTA summarizer frameworks as baselines and observe that when augmented with their corresponding improved user-encoders, they consistently show an increase in personalization (avg. boost: $\text{61.2\%}\uparrow$ w.r.t. PSE-SU4 metric). As a post-hoc analysis of the role of induced diversity in the augmented dataset by \peraugy, we introduce three dataset diversity metrics -- $\mathrm{TP}$, $\mathrm{RTC}$, and \degreed\ to quantify the induced diversity. We find that $\mathrm{TP}$ and $\mathrm{DegreeD}$ strongly correlate with user-encoder performance on the PerAugy-generated dataset across all accuracy metrics, indicating that increased dataset diversity is a key factor driving performance gains.

Diversity Augmentation of Dynamic User Preference Data for Boosting Personalized Text Summarizers

TL;DR

This work tackles the scarcity of training data for dynamic, personalized summarization by introducing PerAugy, a cross-trajectory data augmentation method operating on a User-Interaction Graph (UIG). Through Double Shuffling (DS) and Stochastic Markovian Perturbation (SMP), PerAugy generates diverse, coherent synthetic user trajectories, significantly boosting encoder accuracy across multiple baselines and improving downstream personalization in frameworks like PENS and GTP. The paper also introduces novelty in evaluating dataset diversity with metrics such as TP, RTC, and DegreeD, finding a strong correlation between DegreeD and user-encoder performance. Cross-domain experiments demonstrate PerAugy’s generalizability to low-resource domains (e.g., OpenAI Reddit), suggesting practical benefits for real-world personalized summarization systems. Overall, PerAugy provides a principled, scalable approach to diversify training data for dynamic user preferences, yielding substantial improvements in personalization quality and robustness.

Abstract

Document summarization enables efficient extraction of user-relevant content but is inherently shaped by individual subjectivity, making it challenging to identify subjective salient information in multifaceted documents. This complexity underscores the necessity for personalized summarization. However, training models for personalized summarization has so far been challenging, particularly because diverse training data containing both user preference history (i.e., click-skip trajectory) and expected (gold-reference) summaries are scarce. The MS/CAS PENS dataset is a valuable resource but includes only preference history without target summaries, preventing end-to-end supervised learning, and its limited topic-transition diversity further restricts generalization. To address this, we propose , a novel cross-trajectory shuffling and summary-content perturbation based data augmentation technique that significantly boosts the accuracy of four state-of-the-art baseline (SOTA) user-encoders commonly used in personalized summarization frameworks (best result: w.r.t AUC). We select two such SOTA summarizer frameworks as baselines and observe that when augmented with their corresponding improved user-encoders, they consistently show an increase in personalization (avg. boost: w.r.t. PSE-SU4 metric). As a post-hoc analysis of the role of induced diversity in the augmented dataset by \peraugy, we introduce three dataset diversity metrics -- , , and \degreed\ to quantify the induced diversity. We find that and strongly correlate with user-encoder performance on the PerAugy-generated dataset across all accuracy metrics, indicating that increased dataset diversity is a key factor driving performance gains.

Paper Structure

This paper contains 127 sections, 8 theorems, 36 equations, 10 figures, 11 tables, 4 algorithms.

Key Result

Lemma 1

For all $i,j$,

Figures (10)

  • Figure 1: UIG construction pipeline for PENS-styled datasets:Step 1: Documents from train/valid data are sequenced as d-nodes; Step 2: Reference personalized headlines for an intersecting d-node from test data are interleaved as s-nodes based on time-step; Step 3: If no intersecting d-node is found, the s-node along with corresponding d-node from test data are simply appended at their respective time-step.
  • Figure 2: UIG construction pipeline for OpenAI-styled datasets:Step 1:Extract NewsID, UserID, confidence, and summary; Step 2:Select top-rating $<U_j,N_{ij}>$ click pairs from filtered confidences; Step 3:Shuffle clicks, skips, and summaries to form trajectories.
  • Figure 3: PerAugy: our proposed framework -- (a) Pipeline overview depicting the two-step augmentation, (b) Double Shuffling (DS) to ensure cross-trajectory augmentation and induce diffusion, (c) Stochastic Markovian Perturbation (SMP) to smoothen the s-nodes and modulate random diffusion incorporated in DS stage.
  • Figure 4: Two stage PENS test data (original) creationpens-acl-2021: Stage 1 - Participants selected 50+ preferred headlines from 1,000 shown titles; Stage 2 - They rewrote headlines for 200 unseen articles using only news bodies, without seeing original titles.
  • Figure 5: Effect of PerAugy hyper-parameters on User-Encoder Accuracy: All encoder models are trained-from-scratch; results summarized in Table \ref{['tab:ablation-comparative-impact']}. Observation-1:Best hyper-parameter values perform consistently across models;Observation-2:For DS, $g_l=40$ and $\tau_{h_{\text{train}}}=l-3$ favor longer profile/history retention;Observation-3:For SMP, $k=10$, $p_{\text{SMP}}=0.8$, $\lambda=0.3$ control abrupt diffusion best, and non-Markovian smoothing is preferred.
  • ...and 5 more figures

Theorems & Definitions (11)

  • Definition 1
  • Definition 2: Degree of Preference Shift (DePS)
  • Definition 3
  • Lemma 1: Local distortions
  • Proposition 1: User-level Bounding Inequalities
  • Corollary 1: Dataset-level Bounding Inequalities
  • Corollary 2: Pure scaling
  • Proposition 2: Substitution Stability of DegreeD
  • Proposition 3: Correlation transfer
  • Lemma 2: Lower bound on $r_{FG}$
  • ...and 1 more