Table of Contents
Fetching ...

Asynchronous Feedback Network for Perceptual Point Cloud Quality Assessment

Yujie Zhang, Qi Yang, Ziyu Shan, Yiling Xu

TL;DR

This work addresses NR-PCQA for point clouds by introducing AFQ-Net, which mimics human visual processing with a dual-branch architecture that enables asynchronous global-to-local guidance. The global branch uses a Vision Transformer to extract attention maps from multi-view texture and depth projections, which are fused via occupancy-weighted fusion to form a global feature $f_g$. The attention maps guide a region-aware local feature extractor using dynamic convolution with region-specific masks $M$ and filters $W$, yielding $f_l$, and a coarse-to-fine regression combines $f_g$ and $f_l$ through two heads, with losses including $L_{reg}$, $L_{dis}$, and $L_{rank}$ to promote progressive refinement. Extensive experiments on three PCQA datasets show AFQ-Net achieving state-of-the-art correlations with subjective MOS and robustness across distortions and cross-dataset settings, highlighting its practical impact for NR-PCQA in real-world pipelines, including compression scenarios.

Abstract

Recent years have witnessed the success of the deep learning-based technique in research of no-reference point cloud quality assessment (NR-PCQA). For a more accurate quality prediction, many previous studies have attempted to capture global and local features in a bottom-up manner, but ignored the interaction and promotion between them. To solve this problem, we propose a novel asynchronous feedback quality prediction network (AFQ-Net). Motivated by human visual perception mechanisms, AFQ-Net employs a dual-branch structure to deal with global and local features, simulating the left and right hemispheres of the human brain, and constructs a feedback module between them. Specifically, the input point clouds are first fed into a transformer-based global encoder to generate the attention maps that highlight these semantically rich regions, followed by being merged into the global feature. Then, we utilize the generated attention maps to perform dynamic convolution for different semantic regions and obtain the local feature. Finally, a coarse-to-fine strategy is adopted to merge the two features into the final quality score. We conduct comprehensive experiments on three datasets and achieve superior performance over the state-of-the-art approaches on all of these datasets. The code will be available at The code will be available at https://github.com/zhangyujie-1998/AFQ-Net.

Asynchronous Feedback Network for Perceptual Point Cloud Quality Assessment

TL;DR

This work addresses NR-PCQA for point clouds by introducing AFQ-Net, which mimics human visual processing with a dual-branch architecture that enables asynchronous global-to-local guidance. The global branch uses a Vision Transformer to extract attention maps from multi-view texture and depth projections, which are fused via occupancy-weighted fusion to form a global feature . The attention maps guide a region-aware local feature extractor using dynamic convolution with region-specific masks and filters , yielding , and a coarse-to-fine regression combines and through two heads, with losses including , , and to promote progressive refinement. Extensive experiments on three PCQA datasets show AFQ-Net achieving state-of-the-art correlations with subjective MOS and robustness across distortions and cross-dataset settings, highlighting its practical impact for NR-PCQA in real-world pipelines, including compression scenarios.

Abstract

Recent years have witnessed the success of the deep learning-based technique in research of no-reference point cloud quality assessment (NR-PCQA). For a more accurate quality prediction, many previous studies have attempted to capture global and local features in a bottom-up manner, but ignored the interaction and promotion between them. To solve this problem, we propose a novel asynchronous feedback quality prediction network (AFQ-Net). Motivated by human visual perception mechanisms, AFQ-Net employs a dual-branch structure to deal with global and local features, simulating the left and right hemispheres of the human brain, and constructs a feedback module between them. Specifically, the input point clouds are first fed into a transformer-based global encoder to generate the attention maps that highlight these semantically rich regions, followed by being merged into the global feature. Then, we utilize the generated attention maps to perform dynamic convolution for different semantic regions and obtain the local feature. Finally, a coarse-to-fine strategy is adopted to merge the two features into the final quality score. We conduct comprehensive experiments on three datasets and achieve superior performance over the state-of-the-art approaches on all of these datasets. The code will be available at The code will be available at https://github.com/zhangyujie-1998/AFQ-Net.
Paper Structure (14 sections, 12 equations, 10 figures, 7 tables)

This paper contains 14 sections, 12 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: The motivation of this paper. The core idea is that the human brain can initially identify various semantic regions (e.g., head, trunks, or background) and then perform fine-grained observation for different regions. Therefore, unlike previous studies that extract features in a bottom-up manner, our method applies a dual-branch architecture with a feedback connection. (a) Feature learning paradigm in previous works. (b) Proposed feature learning strategy.
  • Figure 2: The proposed AFQ-Net framework, which includes two branches and a feedback module, where $\oplus$ denotes the concatenation operation. We first divide images into patches and feed them into a transformer encoder, and the attention maps of class token are merged into the global feature. Then, we utilize the attention maps to generate a guided mask and multiple groups of filter, followed by performing region-aware convolution to derive the local feature. Finally, we employ a coarse-to-fine quality prediction based on the extracted two features.
  • Figure 3: Visusalization of the attention maps corresponding to the class token and the generated mask. (a) Original projected texture images. (b) Attention maps corresponding to the class token. (c) Guided masks that divide the images into multiple semantic regions. Different colors in (b) indicate the size of the element value in the attention map; the brighter the color, the larger the element value. In (c), colors are used to distinguish different regions, and pixels of the same color are considered to have the same semantics.
  • Figure 4: The specific architecture of the feedback module and the local branch. (a) Feedback module. (b) Local branch.
  • Figure 5: Visualization of local feature maps.
  • ...and 5 more figures