Table of Contents
Fetching ...

HiVid: LLM-Guided Video Saliency For Content-Aware VOD And Live Streaming

Jiahui Chen, Bo Peng, Lianchen Jia, Zeyu Zhang, Tianchi Huang, Lifeng Sun

TL;DR

HiVid introduces an LLM-guided framework to generate chunk-level saliency weights for content-aware streaming, addressing the limitations of manual annotation and CV-based saliency. It deploys three modules—perception (windowed LLM-rated frame groups and periodic summaries), ranking (LLM-guided merge-sort for global consistency), and prediction (multi-modal time-series forecasting with content-aware attention and adaptive horizon) to support VOD and live streaming with asynchronous, low-latency operation. Across public datasets, HiVid achieves up to 11.5% PLCC and 26% forecasting improvements over state-of-the-art baselines, and a real-world QoE study shows a 14.7% boost in MOS correlation. These results demonstrate that LLM-driven subjective reasoning can effectively substitute costly human ratings for scalable, high-fidelity content-aware streaming.

Abstract

Content-aware streaming requires dynamic, chunk-level importance weights to optimize subjective quality of experience (QoE). However, direct human annotation is prohibitively expensive while vision-saliency models generalize poorly. We introduce HiVid, the first framework to leverage Large Language Models (LLMs) as a scalable human proxy to generate high-fidelity weights for both Video-on-Demand (VOD) and live streaming. We address 3 non-trivial challenges: (1) To extend LLMs' limited modality and circumvent token limits, we propose a perception module to assess frames in a local context window, autoregressively building a coherent understanding of the video. (2) For VOD with rating inconsistency across local windows, we propose a ranking module to perform global re-ranking with a novel LLM-guided merge-sort algorithm. (3) For live streaming which requires low-latency, online inference without future knowledge, we propose a prediction module to predict future weights with a multi-modal time series model, which comprises a content-aware attention and adaptive horizon to accommodate asynchronous LLM inference. Extensive experiments show HiVid improves weight prediction accuracy by up to 11.5\% for VOD and 26\% for live streaming over SOTA baselines. Real-world user study validates HiVid boosts streaming QoE correlation by 14.7\%.

HiVid: LLM-Guided Video Saliency For Content-Aware VOD And Live Streaming

TL;DR

HiVid introduces an LLM-guided framework to generate chunk-level saliency weights for content-aware streaming, addressing the limitations of manual annotation and CV-based saliency. It deploys three modules—perception (windowed LLM-rated frame groups and periodic summaries), ranking (LLM-guided merge-sort for global consistency), and prediction (multi-modal time-series forecasting with content-aware attention and adaptive horizon) to support VOD and live streaming with asynchronous, low-latency operation. Across public datasets, HiVid achieves up to 11.5% PLCC and 26% forecasting improvements over state-of-the-art baselines, and a real-world QoE study shows a 14.7% boost in MOS correlation. These results demonstrate that LLM-driven subjective reasoning can effectively substitute costly human ratings for scalable, high-fidelity content-aware streaming.

Abstract

Content-aware streaming requires dynamic, chunk-level importance weights to optimize subjective quality of experience (QoE). However, direct human annotation is prohibitively expensive while vision-saliency models generalize poorly. We introduce HiVid, the first framework to leverage Large Language Models (LLMs) as a scalable human proxy to generate high-fidelity weights for both Video-on-Demand (VOD) and live streaming. We address 3 non-trivial challenges: (1) To extend LLMs' limited modality and circumvent token limits, we propose a perception module to assess frames in a local context window, autoregressively building a coherent understanding of the video. (2) For VOD with rating inconsistency across local windows, we propose a ranking module to perform global re-ranking with a novel LLM-guided merge-sort algorithm. (3) For live streaming which requires low-latency, online inference without future knowledge, we propose a prediction module to predict future weights with a multi-modal time series model, which comprises a content-aware attention and adaptive horizon to accommodate asynchronous LLM inference. Extensive experiments show HiVid improves weight prediction accuracy by up to 11.5\% for VOD and 26\% for live streaming over SOTA baselines. Real-world user study validates HiVid boosts streaming QoE correlation by 14.7\%.
Paper Structure (28 sections, 7 equations, 10 figures, 17 tables, 2 algorithms)

This paper contains 28 sections, 7 equations, 10 figures, 17 tables, 2 algorithms.

Figures (10)

  • Figure 1: Overview of content-aware streaming. The estimated chunk weights $w_i$ are incorporated into QoE and optimized by ABRs. Higher weights would render better viewing experience.
  • Figure 2: Inaccurate saliency of previous work and significant overhead of human ratings.
  • Figure 3: Inconsistent rating distribution.
  • Figure 4: Overview of HiVid. The perception module generates a video summary with group ratings. The ranking module yields a ranking list via a variant merge sort algorithm for VOD streaming. The prediction module predicts future weights via adaptive forecasting for live streaming. The final weights $w_i$ are incorporated into the QoE model.
  • Figure 5: We predict future weights upon LLM response. The future horizon is latency-adaptive.
  • ...and 5 more figures