CLIP-PCQA: Exploring Subjective-Aligned Vision-Language Modeling for Point Cloud Quality Assessment

Yating Liu; Yujie Zhang; Ziyu Shan; Yiling Xu

CLIP-PCQA: Exploring Subjective-Aligned Vision-Language Modeling for Point Cloud Quality Assessment

Yating Liu, Yujie Zhang, Ziyu Shan, Yiling Xu

TL;DR

CLIP-PCQA targets NR-PCQA by predicting an Opinion Score Distribution $\\hat{P}$ over $K$ quality descriptions, instead of a single MOS. It projects 3D point clouds into $M$ view color and depth maps, extracts visual features with two fine-tuned CLIP encoders, and uses learnable prompts for $K$ textual descriptions to compute similarities. The predicted distribution is converted to a final score via $\\hat{Q} = \\sum_{k=1}^K \\hat{p}_k q_k$ and trained with a combined loss $L = L_{emd} + \alpha L_{quan} + \beta L_{con}$. On SJTU-PCQA, LS-PCQA, and BASICS, it achieves state-of-the-art performance and demonstrates robust cross-database generalization, with visual analyses confirming alignment between predicted OS distributions and subjective judgments.

Abstract

In recent years, No-Reference Point Cloud Quality Assessment (NR-PCQA) research has achieved significant progress. However, existing methods mostly seek a direct mapping function from visual data to the Mean Opinion Score (MOS), which is contradictory to the mechanism of practical subjective evaluation. To address this, we propose a novel language-driven PCQA method named CLIP-PCQA. Considering that human beings prefer to describe visual quality using discrete quality descriptions (e.g., "excellent" and "poor") rather than specific scores, we adopt a retrieval-based mapping strategy to simulate the process of subjective assessment. More specifically, based on the philosophy of CLIP, we calculate the cosine similarity between the visual features and multiple textual features corresponding to different quality descriptions, in which process an effective contrastive loss and learnable prompts are introduced to enhance the feature extraction. Meanwhile, given the personal limitations and bias in subjective experiments, we further covert the feature similarities into probabilities and consider the Opinion Score Distribution (OSD) rather than a single MOS as the final target. Experimental results show that our CLIP-PCQA outperforms other State-Of-The-Art (SOTA) approaches.

CLIP-PCQA: Exploring Subjective-Aligned Vision-Language Modeling for Point Cloud Quality Assessment

TL;DR

CLIP-PCQA targets NR-PCQA by predicting an Opinion Score Distribution

over

quality descriptions, instead of a single MOS. It projects 3D point clouds into

view color and depth maps, extracts visual features with two fine-tuned CLIP encoders, and uses learnable prompts for

textual descriptions to compute similarities. The predicted distribution is converted to a final score via

and trained with a combined loss

. On SJTU-PCQA, LS-PCQA, and BASICS, it achieves state-of-the-art performance and demonstrates robust cross-database generalization, with visual analyses confirming alignment between predicted OS distributions and subjective judgments.

Abstract

Paper Structure (26 sections, 15 equations, 7 figures, 8 tables)

This paper contains 26 sections, 15 equations, 7 figures, 8 tables.

Introduction
Related Works
No-Reference Point Cloud Quality Assessment
Vision-Language Learning in Quality Assessment
Proposed Method
Problem Formulation
Preprocessing
Multi-Modal Feature Extraction
Visual Feature Extraction
Textual Feature Extraction
Vision-Language Alignment
Loss Function
Experiments
Databases and Evaluation Metrics
Implementation Details
...and 11 more sections

Figures (7)

Figure 1: The process of subjective experiment. Participants first undergo a training process to understand the voting standard, then they are able to convert the quality descriptions into quantitative values. Note that multiple subjects may give diverse scores, which forms the score distribution.
Figure 2: The proposed CLIP-PCQA framework, which includes two main parts: multi-modal feature extraction and vision-language alignment. We use the encoders in CLIP to extract features and then perform vision-language alignment using OSD.
Figure 3: Illustration of the visual encoder architecture.
Figure 4: Weight matrice visualization. Darker colors indicate higher values, which mean greater correlations.
Figure 5: PCA of the visual features on complete LS-PCQA.
...and 2 more figures

CLIP-PCQA: Exploring Subjective-Aligned Vision-Language Modeling for Point Cloud Quality Assessment

TL;DR

Abstract

CLIP-PCQA: Exploring Subjective-Aligned Vision-Language Modeling for Point Cloud Quality Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (7)