Table of Contents
Fetching ...

Rho-Perfect: Correlation Ceiling For Subjective Evaluation Datasets

Fredrik Cumlin

TL;DR

ρ-Perfect provides a principled upper bound on model-human correlation for subjectively rated data by decomposing outcome variance under heteroscedastic noise. It defines the ceiling as $\rho$-Perfect = $\sqrt{\frac{\text{Var}(\hat{Y})}{\text{Var}(Y)}}$ where $\hat{Y}=\mathbb{E}[Y|X]$, and validates the squared bound as an estimator of the correlation between two independent subjective evaluations via $\rho$-Perfect^2$. The method is experimentally validated with Split-Raters and Split-Ratings across BVCC, MovieLens, SOMOS, and MERP, showing $\mathbb{E}[\text{Cov}(Y_1,Y_2|X)]\approx0$ and that $\rho$-Perfect^2 tracks true reliability better than conventional ICC in unbalanced settings. A practical case on NISQA with DNSMOS Pro demonstrates that a high $\rho$-Perfect upper bound helps distinguish data reliability from model shortcomings and informs where improvements are needed. The work provides a scalable, interpretable metric to contextualize model performance on subjective datasets and supports more nuanced evaluation in speech, aesthetics, and recommendation domains.

Abstract

Subjective ratings contain inherent noise that limits the model-human correlation, but this reliability issue is rarely quantified. In this paper, we present $ρ$-Perfect, a practical estimation of the highest achievable correlation of a model on subjectively rated datasets. We define $ρ$-Perfect to be the correlation between a perfect predictor and human ratings, and derive an estimate of the value based on heteroscedastic noise scenarios, a common occurrence in subjectively rated datasets. We show that $ρ$-Perfect squared estimates test-retest correlation and use this to validate the estimate. We demonstrate the use of $ρ$-Perfect on a speech quality dataset and show how the measure can distinguish between model limitations and data quality issues.

Rho-Perfect: Correlation Ceiling For Subjective Evaluation Datasets

TL;DR

ρ-Perfect provides a principled upper bound on model-human correlation for subjectively rated data by decomposing outcome variance under heteroscedastic noise. It defines the ceiling as -Perfect = where , and validates the squared bound as an estimator of the correlation between two independent subjective evaluations via -Perfect^2\mathbb{E}[\text{Cov}(Y_1,Y_2|X)]\approx0\rho\rho$-Perfect upper bound helps distinguish data reliability from model shortcomings and informs where improvements are needed. The work provides a scalable, interpretable metric to contextualize model performance on subjective datasets and supports more nuanced evaluation in speech, aesthetics, and recommendation domains.

Abstract

Subjective ratings contain inherent noise that limits the model-human correlation, but this reliability issue is rarely quantified. In this paper, we present -Perfect, a practical estimation of the highest achievable correlation of a model on subjectively rated datasets. We define -Perfect to be the correlation between a perfect predictor and human ratings, and derive an estimate of the value based on heteroscedastic noise scenarios, a common occurrence in subjectively rated datasets. We show that -Perfect squared estimates test-retest correlation and use this to validate the estimate. We demonstrate the use of -Perfect on a speech quality dataset and show how the measure can distinguish between model limitations and data quality issues.
Paper Structure (9 sections, 1 theorem, 13 equations, 3 tables)

This paper contains 9 sections, 1 theorem, 13 equations, 3 tables.

Key Result

Lemma 2.1

Let $X,Y$ be two random variables and $\hat{Y}=\mathbb{E}[Y\vert X]$. Then the correlation of $\hat{Y}$ and $Y$ is given by

Theorems & Definitions (3)

  • Lemma 2.1
  • proof
  • Definition 2.1: $\rho$-Perfect