Table of Contents
Fetching ...

SWaT: Statistical Modeling of Video Watch Time through User Behavior Analysis

Shentao Yang, Haichuan Yang, Linna Du, Adithya Ganesh, Bo Peng, Boying Liu, Serena Li, Ji Liu

TL;DR

This work tackles the challenge of predicting video watch time in short-video platforms by introducing SWaT, a white-box, user-centric statistical framework that converts domain knowledge about watching behavior into bucketed probabilistic models over an infinite video horizon. SWaT instantiates two models, SWaT-Binom and SWaT-Geo, to capture wandering versus focused watching, with bucket-specific probabilities $\{p_i\}$ used to compute a closed-form estimate of the expected watch time $\mathbb{E}[T]$. The approach stabilizes training through horizon bucketization and aligns with existing classification-based losses, enabling seamless integration into industrial recommender systems. Extensive experiments on public datasets (CIKM16 Cup, KuaiRec), a large-scale offline industrial dataset, and online AB testing demonstrate competitive performance against strong baselines and clear online gains in watch time and engagement, confirming the practical value of modeling user behavior in watch-time prediction.

Abstract

The significance of estimating video watch time has been highlighted by the rising importance of (short) video recommendation, which has become a core product of mainstream social media platforms. Modeling video watch time, however, has been challenged by the complexity of user-video interaction, such as different user behavior modes in watching the recommended videos and varying watching probability over the video progress bar. Despite the importance and challenges, existing literature on modeling video watch time mostly focuses on relatively black-box mechanical enhancement of the classical regression/classification losses, without factoring in user behavior in a principled manner. In this paper, we for the first time take on a user-centric perspective to model video watch time, from which we propose a white-box statistical framework that directly translates various user behavior assumptions in watching (short) videos into statistical watch time models. These behavior assumptions are portrayed by our domain knowledge on users' behavior modes in video watching. We further employ bucketization to cope with user's non-stationary watching probability over the video progress bar, which additionally helps to respect the constraint of video length and facilitate the practical compatibility between the continuous regression event of watch time and other binary classification events. We test our models extensively on two public datasets, a large-scale offline industrial dataset, and an online A/B test on a short video platform with hundreds of millions of daily-active users. On all experiments, our models perform competitively against strong relevant baselines, demonstrating the efficacy of our user-centric perspective and proposed framework.

SWaT: Statistical Modeling of Video Watch Time through User Behavior Analysis

TL;DR

This work tackles the challenge of predicting video watch time in short-video platforms by introducing SWaT, a white-box, user-centric statistical framework that converts domain knowledge about watching behavior into bucketed probabilistic models over an infinite video horizon. SWaT instantiates two models, SWaT-Binom and SWaT-Geo, to capture wandering versus focused watching, with bucket-specific probabilities used to compute a closed-form estimate of the expected watch time . The approach stabilizes training through horizon bucketization and aligns with existing classification-based losses, enabling seamless integration into industrial recommender systems. Extensive experiments on public datasets (CIKM16 Cup, KuaiRec), a large-scale offline industrial dataset, and online AB testing demonstrate competitive performance against strong baselines and clear online gains in watch time and engagement, confirming the practical value of modeling user behavior in watch-time prediction.

Abstract

The significance of estimating video watch time has been highlighted by the rising importance of (short) video recommendation, which has become a core product of mainstream social media platforms. Modeling video watch time, however, has been challenged by the complexity of user-video interaction, such as different user behavior modes in watching the recommended videos and varying watching probability over the video progress bar. Despite the importance and challenges, existing literature on modeling video watch time mostly focuses on relatively black-box mechanical enhancement of the classical regression/classification losses, without factoring in user behavior in a principled manner. In this paper, we for the first time take on a user-centric perspective to model video watch time, from which we propose a white-box statistical framework that directly translates various user behavior assumptions in watching (short) videos into statistical watch time models. These behavior assumptions are portrayed by our domain knowledge on users' behavior modes in video watching. We further employ bucketization to cope with user's non-stationary watching probability over the video progress bar, which additionally helps to respect the constraint of video length and facilitate the practical compatibility between the continuous regression event of watch time and other binary classification events. We test our models extensively on two public datasets, a large-scale offline industrial dataset, and an online A/B test on a short video platform with hundreds of millions of daily-active users. On all experiments, our models perform competitively against strong relevant baselines, demonstrating the efficacy of our user-centric perspective and proposed framework.
Paper Structure (34 sections, 20 equations, 6 figures, 4 tables)

This paper contains 34 sections, 20 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the SWaT framework. The model outputs the watching probability $\widehat{p}_i$ for each bucket $B_i$, based on which an estimate $\widehat{\mu}[T]$ of the watch time $T$ is calculated.
  • Figure 2: Illustration of the SWaT-Binom model in \ref{['sec:bimon_model']}, showing three buckets respectively of length $5$, $7$, and $10$. The watch time in bucket 1 is $3$, bucket 2 is $2$, and bucket 3 is $5$.
  • Figure 3: Illustration of the SWaT-Geo model in \ref{['sec:bucket_geo_model']}, showing three buckets of lengths $5$, $7$, and $10$ and the probability calculations of total watch time $T=4,10,13$.
  • Figure 4: Performance of the SWaT-Binom model (\ref{['sec:bimon_model']}) and SWaT-Geo model (\ref{['sec:bucket_geo_model']}) on the KuaiRec dataset on the metrics XAUC and MAE, when varying the number of buckets across $\left\{10,20,50,100,200\right\}$. Horizontal axis denotes the number of buckets and vertical axis denotes the metric value.
  • Figure 5: Performance of the SWaT-Binom model (\ref{['sec:bimon_model']}) and SWaT-Geo model (\ref{['sec:bucket_geo_model']}) on the CIKM16 dataset on the metrics XAUC and MAE, when varying the number of buckets across $\left\{10,20,50,100,200\right\}$. Horizontal axis denotes the number of buckets and vertical axis denotes the metric value.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Remark 3.1
  • Remark 3.2
  • Remark 3.3
  • Remark 3.4
  • Remark 3.5