Table of Contents
Fetching ...

Spatial Visibility and Temporal Dynamics: Revolutionizing Field of View Prediction in Adaptive Point Cloud Video Streaming

Chen Li, Tongyu Zong, Yueyu Hu, Yao Wang, Yong Liu

TL;DR

CellSight significantly improves the long-term cell visibility prediction, reducing the prediction Mean Squared Error (MSE) loss by up to 50% compared to the state-of-the-art models when predicting 2 to 5 seconds ahead, while maintaining real-time performance for point cloud videos with over 1 million points.

Abstract

Field-of-View (FoV) adaptive streaming significantly reduces bandwidth requirement of immersive point cloud video (PCV) by only transmitting visible points in a viewer's FoV. The traditional approaches often focus on trajectory-based 6 degree-of-freedom (6DoF) FoV predictions. The predicted FoV is then used to calculate point visibility. Such approaches do not explicitly consider video content's impact on viewer attention, and the conversion from FoV to point visibility is often error-prone and time-consuming. We reformulate the PCV FoV prediction problem from the cell visibility perspective, allowing for precise decision-making regarding the transmission of 3D data at the cell level based on the predicted visibility distribution. We develop a novel spatial visibility and object-aware graph model that leverages the historical 3D visibility data and incorporates spatial perception, neighboring cell correlation, and occlusion information to predict the cell visibility in the future. Our model significantly improves the long-term cell visibility prediction, reducing the prediction MSE loss by up to 50% compared to the state-of-the-art models while maintaining real-time performance (more than 30fps) for point cloud videos with over 1 million points.

Spatial Visibility and Temporal Dynamics: Revolutionizing Field of View Prediction in Adaptive Point Cloud Video Streaming

TL;DR

CellSight significantly improves the long-term cell visibility prediction, reducing the prediction Mean Squared Error (MSE) loss by up to 50% compared to the state-of-the-art models when predicting 2 to 5 seconds ahead, while maintaining real-time performance for point cloud videos with over 1 million points.

Abstract

Field-of-View (FoV) adaptive streaming significantly reduces bandwidth requirement of immersive point cloud video (PCV) by only transmitting visible points in a viewer's FoV. The traditional approaches often focus on trajectory-based 6 degree-of-freedom (6DoF) FoV predictions. The predicted FoV is then used to calculate point visibility. Such approaches do not explicitly consider video content's impact on viewer attention, and the conversion from FoV to point visibility is often error-prone and time-consuming. We reformulate the PCV FoV prediction problem from the cell visibility perspective, allowing for precise decision-making regarding the transmission of 3D data at the cell level based on the predicted visibility distribution. We develop a novel spatial visibility and object-aware graph model that leverages the historical 3D visibility data and incorporates spatial perception, neighboring cell correlation, and occlusion information to predict the cell visibility in the future. Our model significantly improves the long-term cell visibility prediction, reducing the prediction MSE loss by up to 50% compared to the state-of-the-art models while maintaining real-time performance (more than 30fps) for point cloud videos with over 1 million points.
Paper Structure (20 sections, 17 equations, 9 figures, 2 tables)

This paper contains 20 sections, 17 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: 6DoF FoV Demonstration. As in Fig.(a), the content within the pyramid which is bounded with far plane and near plane is inside the viewport and any content outside pyramid will not be seen by viewer. Furthermore, even though some points are inside the viewport, they are still not visible if occluded by other points. So if there is a viewer watching the point cloud content from side as Fig. 1(a) shows (the actual view is shown in Fig.1(b)), only the highlighted part in Fig. 1(a) would be the visible. If the point cloud is divided into 3D cells like in Fig. 1(c), only the cells covering the visible points need to be transmitted.
  • Figure 2: Overview of our cell visibility prediction system. The curve on the top is the viewer's 6DoF viewport trajectory, illustrating one of the the 6DoF coordinates $(x,y,z,\psi,\theta, \phi)$ at each frame time $t$ for a point cloud sequence set $\mathcal{P}^{h+f}$. We partition the whole space into 3D cells and model the cells into 3D grid-like graph. For each frame $t$, based on the 6DoF coordinates and $P^i$, we can calculate, for each cell $i$, the total number of points $O_i$, visible points $V_i$, and percent of cell volume within the viewport (called the cell-based viewport feature) $F_i$. We use $G_i=[O_i, F_i, V_i, E_i]$ to represent the node feature for each frame, where $E_i$ is other features like the coordinates of each cell. We use a temporal Bidirectional GRU model and a spatial transformer-based graph model to capture patterns in viewer attention and cell visibility. GRU model captures each node's temporal pattern over time, which is encoded into the hidden state $S_h$. In the graph model, each node aggregates its neighbor's information and hidden state from GRU, as the dash line shows. To simplify the graph, we only show the graph attention updating on the $G_1$. After we got the $S_{h+1}$ and $S^{'}_{0}$ from bi-directional GRU, a MLP modal will predict the cell visibility at the target time stamp $t^{h+f}$. Before the final output, the $O^{h+f}$ will be applied as mask to get the predicted visibility $\hat{V}^{h+f}$, since in the streaming system, the server has the point cloud at frame $h+f$. We can optionally predict the overlap ratios between viewport and cells, $\hat{F}^{h+f}$, as well (which essentially predict the viewport). The $\hat{Y}^{h+f}$ is the prediction output, indicating either $\hat{V}^{h+f}$ or $\hat{F}^{h+f}$.
  • Figure 3: (a) is an example frame of a point cloud video. F(b) is the graph we build for the full scene video and each grid is a node in graph. Since the object in the video is moving, we voxelize the whole space into grids.
  • Figure 4: In this illustration, given view's 6DoF and the point cloud frame in 3D cell, we can get the occupancy feature, viewport feature and visibility feature, as $o_i, f_i, v_i$. We have 8 node in total, and 5 of the node with color is occupied by points and has different number of points. $n_6$ has 10 points in total but can be occluded by other points and visibility is 0.4. For other node without points, the visibility feature is set as 0.
  • Figure 5: $R^2$ Score of Cell Visibility Prediction at Different Prediction Horizons.
  • ...and 4 more figures