Table of Contents
Fetching ...

Tile-Weighted Rate-Distortion Optimized Packet Scheduling for 360$^\circ$ VR Video Streaming

Haopeng Wang, Haiwei Dong, Abdulmotaleb El Saddik

TL;DR

This work addresses bandwidth-constrained delivery of 360° VR video by introducing a tile-weighted rate-distortion framework that prioritizes the viewport. It combines a multimodal spatial-temporal attention transformer for viewpoint prediction with a tile-ranking scheme and a dynamic-programming-based packet scheduler to minimize a weighted distortion under a bitrate cap. Key contributions include the tile-weighted rate-distortion formulation, a DP solution with $O(n^2)$ complexity, and extensive evaluations on MMSys18 benchmarks and real-world traces showing improved viewport distortion and robustness. The approach can significantly enhance user experience in VR streaming by focusing resources on the viewport while controlling distortion, with feasible runtimes enabling online deployment and potential integration with scalable or adaptive coding strategies.

Abstract

A key challenge of 360$^\circ$ VR video streaming is ensuring high quality with limited network bandwidth. Currently, most studies focus on tile-based adaptive bitrate streaming to reduce bandwidth consumption, where resources in network nodes are not fully utilized. This article proposes a tile-weighted rate-distortion (TWRD) packet scheduling optimization system to reduce data volume and improve video quality. A multimodal spatial-temporal attention transformer is proposed to predict viewpoint with probability that is used to dynamically weight tiles and corresponding packets. The packet scheduling problem of determining which packets should be dropped is formulated as an optimization problem solved by a dynamic programming solution. Experiment results demonstrate the proposed method outperforms the existing methods under various conditions.

Tile-Weighted Rate-Distortion Optimized Packet Scheduling for 360$^\circ$ VR Video Streaming

TL;DR

This work addresses bandwidth-constrained delivery of 360° VR video by introducing a tile-weighted rate-distortion framework that prioritizes the viewport. It combines a multimodal spatial-temporal attention transformer for viewpoint prediction with a tile-ranking scheme and a dynamic-programming-based packet scheduler to minimize a weighted distortion under a bitrate cap. Key contributions include the tile-weighted rate-distortion formulation, a DP solution with complexity, and extensive evaluations on MMSys18 benchmarks and real-world traces showing improved viewport distortion and robustness. The approach can significantly enhance user experience in VR streaming by focusing resources on the viewport while controlling distortion, with feasible runtimes enabling online deployment and potential integration with scalable or adaptive coding strategies.

Abstract

A key challenge of 360 VR video streaming is ensuring high quality with limited network bandwidth. Currently, most studies focus on tile-based adaptive bitrate streaming to reduce bandwidth consumption, where resources in network nodes are not fully utilized. This article proposes a tile-weighted rate-distortion (TWRD) packet scheduling optimization system to reduce data volume and improve video quality. A multimodal spatial-temporal attention transformer is proposed to predict viewpoint with probability that is used to dynamically weight tiles and corresponding packets. The packet scheduling problem of determining which packets should be dropped is formulated as an optimization problem solved by a dynamic programming solution. Experiment results demonstrate the proposed method outperforms the existing methods under various conditions.
Paper Structure (21 sections, 23 equations, 3 figures, 3 tables)

This paper contains 21 sections, 23 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Tile-weighted rate-distortion packet scheduling optimization system for VR video streaming. Each packet has a color indicating its importance, and a width indicating its size. Due to bandwidth restrictions, some packets are dropped according to the proposed method. The visual quality of the reconstructed video mainly depends on the type of the dropped packet.
  • Figure 2: The architecture of the multimodal spatial-temporal attention transformer model for viewpoint prediction with classification method.
  • Figure 3: Experiment results for the five methods under different bandwidth scenarios. Constant bandwidth scenario: (a) Total distortion. (b) Viewport distortion. (c) Total packet loss. (d) Viewport packet loss. (e) Bandwidth consumption. Real-world trace scenario: (f) Experiment results on the five metrics under the 4G LTE dataset.