Interactive $360^{\circ}$ Video Streaming Using FoV-Adaptive Coding with Temporal Prediction

Yixiang Mao; Liyang Sun; Yong Liu; Yao Wang

Interactive $360^{\circ}$ Video Streaming Using FoV-Adaptive Coding with Temporal Prediction

Yixiang Mao, Liyang Sun, Yong Liu, Yao Wang

TL;DR

This work develops a low-latency FoV-adaptive coding and streaming system for interactive applications that is robust to bandwidth variations and FoV prediction errors, and develops LSTM-based machine learning models to predict the user's FoV and network bandwidth.

Abstract

For $360^{\circ}$ video streaming, FoV-adaptive coding that allocates more bits for the predicted user's field of view (FoV) is an effective way to maximize the rendered video quality under the limited bandwidth. We develop a low-latency FoV-adaptive coding and streaming system for interactive applications that is robust to bandwidth variations and FoV prediction errors. To minimize the end-to-end delay and yet maximize the coding efficiency, we propose a frame-level FoV-adaptive inter-coding structure. In each frame, regions that are in or near the predicted FoV are coded using temporal and spatial prediction, while a small rotating region is coded with spatial prediction only. This rotating intra region periodically refreshes the entire frame, thereby providing robustness to both FoV prediction errors and frame losses due to transmission errors. The system adapts the sizes and rates of different regions for each video segment to maximize the rendered video quality under the predicted bandwidth constraint. Integrating such frame-level FoV adaptation with temporal prediction is challenging due to the temporal variations of the FoV. We propose novel ways for modeling the influence of FoV dynamics on the quality-rate performance of temporal predictive coding.We further develop LSTM-based machine learning models to predict the user's FoV and network bandwidth.The proposed system is compared with three benchmark systems, using real-world network bandwidth traces and FoV traces, and is shown to significantly improve the rendered video quality, while achieving very low end-to-end delay and low frame-freeze probability.

Interactive $360^{\circ}$ Video Streaming Using FoV-Adaptive Coding with Temporal Prediction

TL;DR

Abstract

For

video streaming, FoV-adaptive coding that allocates more bits for the predicted user's field of view (FoV) is an effective way to maximize the rendered video quality under the limited bandwidth. We develop a low-latency FoV-adaptive coding and streaming system for interactive applications that is robust to bandwidth variations and FoV prediction errors. To minimize the end-to-end delay and yet maximize the coding efficiency, we propose a frame-level FoV-adaptive inter-coding structure. In each frame, regions that are in or near the predicted FoV are coded using temporal and spatial prediction, while a small rotating region is coded with spatial prediction only. This rotating intra region periodically refreshes the entire frame, thereby providing robustness to both FoV prediction errors and frame losses due to transmission errors. The system adapts the sizes and rates of different regions for each video segment to maximize the rendered video quality under the predicted bandwidth constraint. Integrating such frame-level FoV adaptation with temporal prediction is challenging due to the temporal variations of the FoV. We propose novel ways for modeling the influence of FoV dynamics on the quality-rate performance of temporal predictive coding.We further develop LSTM-based machine learning models to predict the user's FoV and network bandwidth.The proposed system is compared with three benchmark systems, using real-world network bandwidth traces and FoV traces, and is shown to significantly improve the rendered video quality, while achieving very low end-to-end delay and low frame-freeze probability.

Paper Structure (31 sections, 19 equations, 11 figures, 1 table)

This paper contains 31 sections, 19 equations, 11 figures, 1 table.

Introduction
Related Works
Proposed FoV-Adaptive Coding Scheme
Tile-based Frame Partitioning and Rate Adaptation Based on Predicted FoV
Optimization of Tile Size
Rate Distortion modeling and Adaptation of Region Size and Rate
Objective quality metric
"Ideal" Quality-Rate Models For Different Coded Regions
Quality-Rate Function for the Predicted FoV region
Quality-Rate Functions for the PF+ Region
Quality-Rate Functions of the RI Region
Rate-Increase Factor
Adjust Quality-Rate Functions for PF and PF+ regions
Quality-Decay Factor
Optimizing Rate Allocation and Region Sizes
...and 16 more sections

Figures (11)

Figure 1: Variable time lapses between the coded tiles inside the PF and PF+ regions. The frame on the right is the current frame, its previous frames are on its left. The square region covered by a solid-line border in each frame indicates the coded region in that frame. Different tiles in the coded region in the current frame have different time lapses to the latest frame when the corresponding tiles were coded.
Figure 2: The tiled ERP frame and different coding regions. Dark grey: tiles to cover the PF region, coded at the rate $R_e$. Light grey and orange: tiles to cover PF+ and RI, coded at the rate $R_b$. Green: user's actual FoV, which may intersect with PF, PF+, RI, and un-coded tiles.
Figure 3: Tiles needed to cover the same FoV. The grey area indicate FoV, which in this example covers 90 degree. White and grey areas indicate all the titles that are needed to cover the FoV.
Figure 4: Total bit consumption under the same QP inside the FoV for 5 different tile sizes for 8K video. The horizontal axis indicates the number of pixels in each side of the square tile. All the tiles are coded in the inter-coding mode except the first frame. A constant QP=30 was used. The resulting WS-PSNR in the FoV region are reported in the figure legends. Different color bars are results for FoVs in different directions. The results are for sequence "Trolley", similar trends are observed for "Chairlift".
Figure 5: Q-R models for "Trolley" (a)-(d) and "Chairlift" (e)-(h). (a)(e): WS-PSNR vs. normalized rate for the PF regions of six viewing orientations, and the averaged WS-PSNR vs. normalized rate. (b)(f): WS-PSNR vs. normalized rate for the PF+ regions when its size is $10^{\circ}$. (c)(g): the averaged WS-PSNR vs. normalized rate for different PF+ region sizes. (d)(h) WS-PSNR vs. normalized rate for the RI region.
...and 6 more figures

Interactive $360^{\circ}$ Video Streaming Using FoV-Adaptive Coding with Temporal Prediction

TL;DR

Abstract

Interactive $360^{\circ}$ Video Streaming Using FoV-Adaptive Coding with Temporal Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (11)