Table of Contents
Fetching ...

SARM: LLM-Augmented Semantic Anchor for End-to-End Live-Streaming Ranking

Ruochen Yang, Yueyang Liu, Zijie Zhuang, Changxin Lao, Yuhui Zhang, Jiangxia Cao, Jia Xu, Xiang Chen, Haoke Xiao, Xiangyu Wu, Xiaoyou Zhou, Xiao Lv, Shuang Yang, Tingwen Liu, Zhaojie Liu, Han Li, Kun Gai

TL;DR

SARM tackles live-streaming ranking under non-stationary content semantics and stringent latency by introducing Semantic Anchors—trainable natural-language representations generated offline by a multimodal LLM. These anchors are encoded in real time by a lightweight Semantic Anchor Encoder with a dual-token gated fusion and integrated into an end-to-end ranking model, synchronized through an asymmetric deployment with a memory bank for efficiency. The approach unifies semantic understanding with ranking optimization, and extensive offline and long-running online A/B tests on the Kuaishou platform show consistent gains in engagement and business metrics while maintaining production efficiency. The work demonstrates practical impact by serving over 400 million users daily and offering a scalable pathway to richer, more controllable multimodal recommendations in live-streaming environments.

Abstract

Large-scale live-streaming recommendation requires precise modeling of non-stationary content semantics under strict real-time serving constraints. In industrial deployment, two common approaches exhibit fundamental limitations: discrete semantic abstractions sacrifice descriptive precision through clustering, while dense multimodal embeddings are extracted independently and remain weakly aligned with ranking optimization, limiting fine-grained content-aware ranking. To address these limitations, we propose \textbf{SARM}, an end-to-end ranking architecture that integrates natural-language semantic anchors directly into ranking optimization, enabling fine-grained author representations conditioned on multimodal content. Each semantic anchor is represented as learnable text tokens jointly optimized with ranking features, allowing the model to adapt content descriptions to ranking objectives. A lightweight dual-token gated design captures domain-specific live-streaming semantics, while an asymmetric deployment strategy preserves low-latency online training and serving. Extensive offline evaluation and large-scale A/B tests show consistent improvements over production baselines. SARM is fully deployed and serves over 400 million users daily.

SARM: LLM-Augmented Semantic Anchor for End-to-End Live-Streaming Ranking

TL;DR

SARM tackles live-streaming ranking under non-stationary content semantics and stringent latency by introducing Semantic Anchors—trainable natural-language representations generated offline by a multimodal LLM. These anchors are encoded in real time by a lightweight Semantic Anchor Encoder with a dual-token gated fusion and integrated into an end-to-end ranking model, synchronized through an asymmetric deployment with a memory bank for efficiency. The approach unifies semantic understanding with ranking optimization, and extensive offline and long-running online A/B tests on the Kuaishou platform show consistent gains in engagement and business metrics while maintaining production efficiency. The work demonstrates practical impact by serving over 400 million users daily and offering a scalable pathway to richer, more controllable multimodal recommendations in live-streaming environments.

Abstract

Large-scale live-streaming recommendation requires precise modeling of non-stationary content semantics under strict real-time serving constraints. In industrial deployment, two common approaches exhibit fundamental limitations: discrete semantic abstractions sacrifice descriptive precision through clustering, while dense multimodal embeddings are extracted independently and remain weakly aligned with ranking optimization, limiting fine-grained content-aware ranking. To address these limitations, we propose \textbf{SARM}, an end-to-end ranking architecture that integrates natural-language semantic anchors directly into ranking optimization, enabling fine-grained author representations conditioned on multimodal content. Each semantic anchor is represented as learnable text tokens jointly optimized with ranking features, allowing the model to adapt content descriptions to ranking objectives. A lightweight dual-token gated design captures domain-specific live-streaming semantics, while an asymmetric deployment strategy preserves low-latency online training and serving. Extensive offline evaluation and large-scale A/B tests show consistent improvements over production baselines. SARM is fully deployed and serves over 400 million users daily.
Paper Structure (34 sections, 16 equations, 8 figures, 7 tables)

This paper contains 34 sections, 16 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparison of semantic representations: (Top) coarse discrete semantics, (Middle) dense embeddings misaligned with ranking objectives, and (Bottom) fine-grained ranking-aware Semantic Anchors.
  • Figure 2: Overall architecture of SARM. A fine-tuned MLLM generates semantic anchors that are encoded and optimized end-to-end with ranking objectives. A memory bank maintains representations for efficient online training and inference.
  • Figure 3: Detailed architecture of the gated fusion module. The Live-Streaming Tokenizer aggregates domain-specific terms, and the gated fusion module integrates the resulting embeddings into a standard LLM via external lookup, where an example tokenization process is illustrated for clarity.
  • Figure 4: Asymmetric deployment pipeline of SARM. A memory bank caches author representations to support efficient online training and inference.
  • Figure 5: Effect of the auxiliary loss in training.
  • ...and 3 more figures