Table of Contents
Fetching ...

Generative Regression Based Watch Time Prediction for Short-Video Recommendation

Hongxu Ma, Kai Tian, Tao Zhang, Xuefeng Zhang, Han Zhou, Chunjie Chen, Han Li, Jihong Guan, Shuigeng Zhou

TL;DR

This work tackles watch time prediction (WTP) for short-video recommendations by reframing the task as Generative Regression (GR), which treats the target $y_i$ as a sequence of time-slot tokens generated autoregressively. By introducing a data-driven vocabulary via dynamic quantile adjustment, a label-encoding scheme that guarantees near-lossless reconstruction, and a Transformer-based encoder–decoder architecture, GR captures inter-token dependencies and mitigates exposure bias with Curriculum Learning and Embedding Mixup (CLEM). Empirically, GR outperforms state-of-the-art baselines on offline datasets and yields positive online A/B results, with additional gains on a cross-task LTV prediction benchmark, suggesting GR as a robust general regression paradigm for recommender systems. The approach offers practical benefits for real-world deployment and provides a foundation for applying generative sequence modeling to continuous regression targets in recommendation contexts, including tail behavior and personalization.

Abstract

Watch time prediction (WTP) has emerged as a pivotal task in short video recommendation systems, designed to quantify user engagement through continuous interaction modeling. Predicting users' watch times on videos often encounters fundamental challenges, including wide value ranges and imbalanced data distributions, which can lead to significant estimation bias when directly applying regression techniques. Recent studies have attempted to address these issues by converting the continuous watch time estimation into an ordinal regression task. While these methods demonstrate partial effectiveness, they exhibit notable limitations: (1) the discretization process frequently relies on bucket partitioning, inherently reducing prediction flexibility and accuracy and (2) the interdependencies among different partition intervals remain underutilized, missing opportunities for effective error correction. Inspired by language modeling paradigms, we propose a novel Generative Regression (GR) framework that reformulates WTP as a sequence generation task. Our approach employs \textit{structural discretization} to enable nearly lossless value reconstruction while maintaining prediction fidelity. Through carefully designed vocabulary construction and label encoding schemes, each watch time is bijectively mapped to a token sequence. To mitigate the training-inference discrepancy caused by teacher-forcing, we introduce a \textit{curriculum learning with embedding mixup} strategy that gradually transitions from guided to free-generation modes. We evaluate our method against state-of-the-art approaches on two public datasets and one industrial dataset. We also perform online A/B testing on the Kuaishou App to confirm the real-world effectiveness. The results conclusively show that GR outperforms existing techniques significantly.

Generative Regression Based Watch Time Prediction for Short-Video Recommendation

TL;DR

This work tackles watch time prediction (WTP) for short-video recommendations by reframing the task as Generative Regression (GR), which treats the target as a sequence of time-slot tokens generated autoregressively. By introducing a data-driven vocabulary via dynamic quantile adjustment, a label-encoding scheme that guarantees near-lossless reconstruction, and a Transformer-based encoder–decoder architecture, GR captures inter-token dependencies and mitigates exposure bias with Curriculum Learning and Embedding Mixup (CLEM). Empirically, GR outperforms state-of-the-art baselines on offline datasets and yields positive online A/B results, with additional gains on a cross-task LTV prediction benchmark, suggesting GR as a robust general regression paradigm for recommender systems. The approach offers practical benefits for real-world deployment and provides a foundation for applying generative sequence modeling to continuous regression targets in recommendation contexts, including tail behavior and personalization.

Abstract

Watch time prediction (WTP) has emerged as a pivotal task in short video recommendation systems, designed to quantify user engagement through continuous interaction modeling. Predicting users' watch times on videos often encounters fundamental challenges, including wide value ranges and imbalanced data distributions, which can lead to significant estimation bias when directly applying regression techniques. Recent studies have attempted to address these issues by converting the continuous watch time estimation into an ordinal regression task. While these methods demonstrate partial effectiveness, they exhibit notable limitations: (1) the discretization process frequently relies on bucket partitioning, inherently reducing prediction flexibility and accuracy and (2) the interdependencies among different partition intervals remain underutilized, missing opportunities for effective error correction. Inspired by language modeling paradigms, we propose a novel Generative Regression (GR) framework that reformulates WTP as a sequence generation task. Our approach employs \textit{structural discretization} to enable nearly lossless value reconstruction while maintaining prediction fidelity. Through carefully designed vocabulary construction and label encoding schemes, each watch time is bijectively mapped to a token sequence. To mitigate the training-inference discrepancy caused by teacher-forcing, we introduce a \textit{curriculum learning with embedding mixup} strategy that gradually transitions from guided to free-generation modes. We evaluate our method against state-of-the-art approaches on two public datasets and one industrial dataset. We also perform online A/B testing on the Kuaishou App to confirm the real-world effectiveness. The results conclusively show that GR outperforms existing techniques significantly.
Paper Structure (29 sections, 10 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 10 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Predictive paradigm comparison among ordinal regression methods CREAD (a) and TPM (b), and our generative regression (c). Red lines indicate the discretization structure.
  • Figure 2: The framework of the GR model, which adopts an encoder-decoder architecture. The encoder extracts user and video features, while the decoder predicts watch time in an autoregressive manner and employs the curriculum learning with embedding mixup (CLEM) strategy to alleviate training-and-inference inconsistency introduced by teacher forcing.
  • Figure 3: Watch time embedding with a weighted sum of token embeddings (left) and the probability distribution difference among tokens for each $\hat{s}_i^t$ (right). Best viewed in color.
  • Figure 4: Token distribution comparison among vocabulary construction methods: (a) Manual, (b) Binary, (c) Dynamic.
  • Figure 5: (a) Comparison of MAE on the KuaiRec dataset across videos with different watch time intervals. (b-d) The distribution comparison of predicted watch times among TPM, CREAD, and GR, compared to the Ground Truth (GT).