Table of Contents
Fetching ...

Sequential Data Augmentation for Generative Recommendation

Geon Lee, Bhuvesh Kumar, Clark Mingxuan Ju, Tong Zhao, Kijung Shin, Neil Shah, Liam Collins

TL;DR

GenPAS is proposed, a generalized and principled framework that models augmentation as a stochastic sampling process over input–target pairs with three bias-controlled steps: sequence sampling, target sampling, and input sampling that yields superior accuracy, data efficiency, and parameter efficiency compared to existing strategies.

Abstract

Generative recommendation plays a crucial role in personalized systems, predicting users' future interactions from their historical behavior sequences. A critical yet underexplored factor in training these models is data augmentation, the process of constructing training data from user interaction histories. By shaping the training distribution, data augmentation directly and often substantially affects model generalization and performance. Nevertheless, in much of the existing work, this process is simplified, applied inconsistently, or treated as a minor design choice, without a systematic and principled understanding of its effects. Motivated by our empirical finding that different augmentation strategies can yield large performance disparities, we conduct an in-depth analysis of how they reshape training distributions and influence alignment with future targets and generalization to unseen inputs. To systematize this design space, we propose GenPAS, a generalized and principled framework that models augmentation as a stochastic sampling process over input-target pairs with three bias-controlled steps: sequence sampling, target sampling, and input sampling. This formulation unifies widely used strategies as special cases and enables flexible control of the resulting training distribution. Our extensive experiments on benchmark and industrial datasets demonstrate that GenPAS yields superior accuracy, data efficiency, and parameter efficiency compared to existing strategies, providing practical guidance for principled training data construction in generative recommendation. Our code is available at https://github.com/snap-research/GenPAS.

Sequential Data Augmentation for Generative Recommendation

TL;DR

GenPAS is proposed, a generalized and principled framework that models augmentation as a stochastic sampling process over input–target pairs with three bias-controlled steps: sequence sampling, target sampling, and input sampling that yields superior accuracy, data efficiency, and parameter efficiency compared to existing strategies.

Abstract

Generative recommendation plays a crucial role in personalized systems, predicting users' future interactions from their historical behavior sequences. A critical yet underexplored factor in training these models is data augmentation, the process of constructing training data from user interaction histories. By shaping the training distribution, data augmentation directly and often substantially affects model generalization and performance. Nevertheless, in much of the existing work, this process is simplified, applied inconsistently, or treated as a minor design choice, without a systematic and principled understanding of its effects. Motivated by our empirical finding that different augmentation strategies can yield large performance disparities, we conduct an in-depth analysis of how they reshape training distributions and influence alignment with future targets and generalization to unseen inputs. To systematize this design space, we propose GenPAS, a generalized and principled framework that models augmentation as a stochastic sampling process over input-target pairs with three bias-controlled steps: sequence sampling, target sampling, and input sampling. This formulation unifies widely used strategies as special cases and enables flexible control of the resulting training distribution. Our extensive experiments on benchmark and industrial datasets demonstrate that GenPAS yields superior accuracy, data efficiency, and parameter efficiency compared to existing strategies, providing practical guidance for principled training data construction in generative recommendation. Our code is available at https://github.com/snap-research/GenPAS.

Paper Structure

This paper contains 23 sections, 2 theorems, 15 equations, 5 figures, 9 tables.

Key Result

Theorem 1

Suppose Assumptions assump:indep_users and assump:indep_items hold. Denote $\delta_k := \text{TV}(p_k, p_{n+1})$. Then, with probability at least $0.99$,

Figures (5)

  • Figure 1: Different strategies yield distinct target distributions. LT skews toward frequent items, while MT and SW produce a more balanced distribution.
  • Figure 2: Different strategies produce distinct input-target distributions. LT yields few inputs per target, MT increases this with more target positions, and SW produces the most by enumerating all subsequences.
  • Figure 3: The two parameters, $\alpha$ and $\beta$, jointly shape the training distribution and have a substantial impact on model performance. Their impact patterns differ across datasets.
  • Figure 4: GenPAS enhances the data efficiency. SASRec with GenPAS outperforms the full-data baseline without augmentation, even when trained on 5, 10, 20% of the original data.
  • Figure 5: GenPAS enhances long-tail performance. SASRec equipped with GenPAS consistently outperforms the non-augmented model across all item groups, from the least popular (G1) to the most popular (G3).

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2