Table of Contents
Fetching ...

PIT-QMM: A Large Multimodal Model For No-Reference Point Cloud Quality Assessment

Shashank Gupta, Gregoire Phillips, Alan C. Bovik

TL;DR

PIT-QMM tackles no-reference point cloud quality assessment by unifying point-cloud, image, and text modalities within a large multimodal model. It introduces a task-aware instruction-following dataset and a two-stage training pipeline with LoRA adapters to fuse local 3D patches, global 2D projections, and psychometric prompts, achieving state-of-the-art results with fewer training iterations. It also demonstrates distortion identification and localization, providing interpretable cues about where and what goes wrong in quality. The approach yields strong cross-dataset generalization and efficient inference, offering a practical path toward interactive, explainable NR-PCQA for 3D assets.

Abstract

Large Multimodal Models (LMMs) have recently enabled considerable advances in the realm of image and video quality assessment, but this progress has yet to be fully explored in the domain of 3D assets. We are interested in using these models to conduct No-Reference Point Cloud Quality Assessment (NR-PCQA), where the aim is to automatically evaluate the perceptual quality of a point cloud in absence of a reference. We begin with the observation that different modalities of data - text descriptions, 2D projections, and 3D point cloud views - provide complementary information about point cloud quality. We then construct PIT-QMM, a novel LMM for NR-PCQA that is capable of consuming text, images and point clouds end-to-end to predict quality scores. Extensive experimentation shows that our proposed method outperforms the state-of-the-art by significant margins on popular benchmarks with fewer training iterations. We also demonstrate that our framework enables distortion localization and identification, which paves a new way forward for model explainability and interactivity. Code and datasets are available at https://www.github.com/shngt/pit-qmm.

PIT-QMM: A Large Multimodal Model For No-Reference Point Cloud Quality Assessment

TL;DR

PIT-QMM tackles no-reference point cloud quality assessment by unifying point-cloud, image, and text modalities within a large multimodal model. It introduces a task-aware instruction-following dataset and a two-stage training pipeline with LoRA adapters to fuse local 3D patches, global 2D projections, and psychometric prompts, achieving state-of-the-art results with fewer training iterations. It also demonstrates distortion identification and localization, providing interpretable cues about where and what goes wrong in quality. The approach yields strong cross-dataset generalization and efficient inference, offering a practical path toward interactive, explainable NR-PCQA for 3D assets.

Abstract

Large Multimodal Models (LMMs) have recently enabled considerable advances in the realm of image and video quality assessment, but this progress has yet to be fully explored in the domain of 3D assets. We are interested in using these models to conduct No-Reference Point Cloud Quality Assessment (NR-PCQA), where the aim is to automatically evaluate the perceptual quality of a point cloud in absence of a reference. We begin with the observation that different modalities of data - text descriptions, 2D projections, and 3D point cloud views - provide complementary information about point cloud quality. We then construct PIT-QMM, a novel LMM for NR-PCQA that is capable of consuming text, images and point clouds end-to-end to predict quality scores. Extensive experimentation shows that our proposed method outperforms the state-of-the-art by significant margins on popular benchmarks with fewer training iterations. We also demonstrate that our framework enables distortion localization and identification, which paves a new way forward for model explainability and interactivity. Code and datasets are available at https://www.github.com/shngt/pit-qmm.

Paper Structure

This paper contains 29 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: An overview of the proposed Point-Image-Text Quality Multimodal Model (PIT-QMM). PIT-QMM takes a raw point cloud and extracts both 2D and 3D views. Rich feature representations of these views are encoded by pretrained foundation models. These representations are then passed into a large multimodal model along with a textual description of the task and experimental setup, which is trained to predict quality scores.
  • Figure 2: The same underlying point cloud can have highly different quality characteristics depending on rendering parameters and the radius of interaction, especially in the NR setting. Point cloud taken from LS-PCQA and rendered in MeshLab. Best viewed zoomed in.