Table of Contents
Fetching ...

Make It Long, Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin

Lin Guan, Jia-Qi Yang, Zhishan Zhao, Beichuan Zhang, Bo Sun, Xuanyuan Luo, Jinan Ni, Xiaowen Li, Yuhang Qi, Zhifang Fan, Hangyu Wang, Qiwei Chen, Yi Cheng, Feng Zhang, Xiao Yang

TL;DR

The paper presents an end-to-end framework for long-history recommendation in short-video platforms, addressing the challenge of 10k-length user histories under production constraints. It introduces STCA to achieve linear-time target-to-history attention, replaces history self-attention with cross-attention, and couples it with Request Level Batching to amortize user encoding across multiple targets. A train-sparsely/infer-densely extrapolation strategy enables dense inference on long histories without proportional training cost. Across offline and online evaluations on Douyin, the approach yields monotonic gains with history length and model capacity, and demonstrates practical deployment feasibility with substantial bandwidth, throughput, and latency benefits. Together, these ideas provide a scalable, end-to-end path for long-sequence ranking suitable for industrial-scale recommendations.

Abstract

Short-video recommenders such as Douyin must exploit extremely long user histories without breaking latency or cost budgets. We present an end-to-end system that scales long-sequence modeling to 10k-length histories in production. First, we introduce Stacked Target-to-History Cross Attention (STCA), which replaces history self-attention with stacked cross-attention from the target to the history, reducing complexity from quadratic to linear in sequence length and enabling efficient end-to-end training. Second, we propose Request Level Batching (RLB), a user-centric batching scheme that aggregates multiple targets for the same user/request to share the user-side encoding, substantially lowering sequence-related storage, communication, and compute without changing the learning objective. Third, we design a length-extrapolative training strategy -- train on shorter windows, infer on much longer ones -- so the model generalizes to 10k histories without additional training cost. Across offline and online experiments, we observe predictable, monotonic gains as we scale history length and model capacity, mirroring the scaling law behavior observed in large language models. Deployed at full traffic on Douyin, our system delivers significant improvements on key engagement metrics while meeting production latency, demonstrating a practical path to scaling end-to-end long-sequence recommendation to the 10k regime.

Make It Long, Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin

TL;DR

The paper presents an end-to-end framework for long-history recommendation in short-video platforms, addressing the challenge of 10k-length user histories under production constraints. It introduces STCA to achieve linear-time target-to-history attention, replaces history self-attention with cross-attention, and couples it with Request Level Batching to amortize user encoding across multiple targets. A train-sparsely/infer-densely extrapolation strategy enables dense inference on long histories without proportional training cost. Across offline and online evaluations on Douyin, the approach yields monotonic gains with history length and model capacity, and demonstrates practical deployment feasibility with substantial bandwidth, throughput, and latency benefits. Together, these ideas provide a scalable, end-to-end path for long-sequence ranking suitable for industrial-scale recommendations.

Abstract

Short-video recommenders such as Douyin must exploit extremely long user histories without breaking latency or cost budgets. We present an end-to-end system that scales long-sequence modeling to 10k-length histories in production. First, we introduce Stacked Target-to-History Cross Attention (STCA), which replaces history self-attention with stacked cross-attention from the target to the history, reducing complexity from quadratic to linear in sequence length and enabling efficient end-to-end training. Second, we propose Request Level Batching (RLB), a user-centric batching scheme that aggregates multiple targets for the same user/request to share the user-side encoding, substantially lowering sequence-related storage, communication, and compute without changing the learning objective. Third, we design a length-extrapolative training strategy -- train on shorter windows, infer on much longer ones -- so the model generalizes to 10k histories without additional training cost. Across offline and online experiments, we observe predictable, monotonic gains as we scale history length and model capacity, mirroring the scaling law behavior observed in large language models. Deployed at full traffic on Douyin, our system delivers significant improvements on key engagement metrics while meeting production latency, demonstrating a practical path to scaling end-to-end long-sequence recommendation to the 10k regime.

Paper Structure

This paper contains 56 sections, 23 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Scaling with sequence length and model capacity. Finish AUC lift (percentage points) for the sequence module (STCA) as we increase user sequence length ($500\!\rightarrow\!10\mathrm{k}$ tokens) and sequence-module capacity (Simple: 6M, Medium: 23M, Complex: 133M parameters).
  • Figure 2: Overview of our long-history ranking stack.(A) Stacked Target Cross Attention: single-query cross attention from the target to the full history with layer-wise fusion, enabling linear scaling in history length and end-to-end optimization. (B) Request Level Batching: compute the user/history encoding once per request and reuse it across multiple targets to reduce bandwidth and compute. (C) Extrapolation Aware Training: sample shorter histories during training and serve longer histories at inference (Train Sparsely, Infer Densely).
  • Figure 3: Compute--quality scaling of STCA vs Transformer. X-axis: per-sample sequence-only forward FLOPs (log scale). Y-axis: NLL (lower is better). Markers show $L\!\in\!\{500,2\mathrm{k},8\mathrm{k},10\mathrm{k}\}$. Both use 4 layers with $d{=}256$, $h{=}8$, $r{=}4$.