Table of Contents
Fetching ...

Robust Estimation of Polychoric Correlation

Max Welz, Patrick Mair, Andreas Alfons

TL;DR

This work addresses the vulnerability of polychoric correlation estimation to latent normality misspecification by introducing a robust C-estimator that downweights poorly fitting contingency cells via a tuning-discrepancy function. The method generalizes ML, retains full efficiency under correct specification, and remains consistent and asymptotically normal under partial misspecification, all at no additional computational cost. Through comprehensive simulations and an empirical Big Five application, the estimator demonstrates substantial robustness to careless responding and can reveal sources of contamination via Pearson residuals. Implementation in an open-source R package (robcat) facilitates practical adoption in SEMs, factor analysis, and related multivariate techniques dealing with ordinal data.

Abstract

Polychoric correlation is often an important building block in the analysis of rating data, particularly for structural equation models. However, the commonly employed maximum likelihood (ML) estimator is highly susceptible to misspecification of the polychoric correlation model, for instance through violations of latent normality assumptions. We propose a novel estimator that is designed to be robust against partial misspecification of the polychoric model, that is, when the model is misspecified for an unknown fraction of observations, such as careless respondents. To this end, the estimator minimizes a robust loss function based on the divergence between observed frequencies and theoretical frequencies implied by the polychoric model. In contrast to existing literature, our estimator makes no assumption on the type or degree of model misspecification. It furthermore generalizes ML estimation, is consistent as well as asymptotically normally distributed, and comes at no additional computational cost. We demonstrate the robustness and practical usefulness of our estimator in simulation studies and an empirical application on a Big Five administration. In the latter, the polychoric correlation estimates of our estimator and ML differ substantially, which, after further inspection, is likely due to the presence of careless respondents that the estimator helps identify.

Robust Estimation of Polychoric Correlation

TL;DR

This work addresses the vulnerability of polychoric correlation estimation to latent normality misspecification by introducing a robust C-estimator that downweights poorly fitting contingency cells via a tuning-discrepancy function. The method generalizes ML, retains full efficiency under correct specification, and remains consistent and asymptotically normal under partial misspecification, all at no additional computational cost. Through comprehensive simulations and an empirical Big Five application, the estimator demonstrates substantial robustness to careless responding and can reveal sources of contamination via Pearson residuals. Implementation in an open-source R package (robcat) facilitates practical adoption in SEMs, factor analysis, and related multivariate techniques dealing with ordinal data.

Abstract

Polychoric correlation is often an important building block in the analysis of rating data, particularly for structural equation models. However, the commonly employed maximum likelihood (ML) estimator is highly susceptible to misspecification of the polychoric correlation model, for instance through violations of latent normality assumptions. We propose a novel estimator that is designed to be robust against partial misspecification of the polychoric model, that is, when the model is misspecified for an unknown fraction of observations, such as careless respondents. To this end, the estimator minimizes a robust loss function based on the divergence between observed frequencies and theoretical frequencies implied by the polychoric model. In contrast to existing literature, our estimator makes no assumption on the type or degree of model misspecification. It furthermore generalizes ML estimation, is consistent as well as asymptotically normally distributed, and comes at no additional computational cost. We demonstrate the robustness and practical usefulness of our estimator in simulation studies and an empirical application on a Big Five administration. In the latter, the polychoric correlation estimates of our estimator and ML differ substantially, which, after further inspection, is likely due to the presence of careless respondents that the estimator helps identify.
Paper Structure (51 sections, 1 theorem, 62 equations, 21 figures, 10 tables)

This paper contains 51 sections, 1 theorem, 62 equations, 21 figures, 10 tables.

Key Result

Theorem 1

For $c\in [0,\infty]$ the tuning constant in the discrepancy function, assume that $\frac{f_{\varepsilon} \left(x,y\right)}{p_{xy}(\bm{\theta}_0)}-1\neq c$ for all $(x,y)\in\mathcal{X}\times\mathcal{Y}$. Then, under certain regularity conditions that do not restrict the degree or type of possible mi as well as where, as a function of $\bm{\theta}\in\bm{\Theta}$, the estimator's invertible asympto

Figures (21)

  • Figure 1: Simulated data with $K_X=K_Y=5$ response options where the polychoric model is misspecified with contamination fraction $\varepsilon=0.15$. The gray dots represent random draws of $(\xi,\eta)$ from the polychoric model with $\rho_*=0.5$, whereas the orange dots represent draws from a contamination distribution that primarily inflates the cell $(x,y)=(5,1)$. The contamination distribution is bivariate normal with a mean $(2.5,-2.5)^\top$, variances $(0.25, 0.25)^\top$, and zero correlation. The blue lines indicate the locations of the thresholds. In each cell, the numbers in parentheses denote the population probability of that cell under the true polychoric model.
  • Figure 2: Visualization of the robust discrepancy function $\varphi(z)$ in \ref{['eq:phifun']} for $c = 0.6$ (solid line) and the ML discrepancy function $\varphi^{\mathrm{MLE}}(z) = (z+1) \log (z+1)$ (dotted line).
  • Figure 3: The population estimand $\rho_0$ of the polychoric correlation coefficient for various degrees of contamination fractions $\varepsilon$ ($x$-axis) and tuning constants $c$ (line colors), for the same contamination distribution as in Figure \ref{['fig:contamexample']}. The ML estimand corresponds to $c=+\infty$. There are $K_X=K_Y=5$ response options and the true value corresponds to $\rho_* = 0.5$ (dashed line).
  • Figure 4: Boxplot visualization of the bias of three estimators of the polychoric correlation coefficient, $\widehat{\rho}_N - \rho_*$, for various contamination fractions in the misspecified polychoric model across 5,000 repetitions. The estimators are the robust estimator with $c=0.6$ (left), the MLE (center), and the Pearson sample correlation (right). Diamonds represent the respective average bias. The dashed line denotes value 0 and the dotted line $-\rho_* = -0.5$, the latter of which indicating a sign flip in the correlation estimate.
  • Figure 5: Absolute average bias (top) and confidence interval coverage (bottom) at nominal level 95% (dashed horizontal lines) of the robust estimator with $c=0.6$ (left) and the MLE (right) for each unique pairwise polychoric correlation coefficient in the true correlation matrix (Table \ref{['tab:cormat-sim']}), expressed as a function of the contamination fraction $\varepsilon$ ($x$-axis). Results are aggregated over 5,000 repetitions.
  • ...and 16 more figures

Theorems & Definitions (1)

  • Theorem 1