Table of Contents
Fetching ...

On the approximation of the between-set correlation matrix by canonical correlation analysis

Jan Graffelman

Abstract

Canonical correlation analysis is a classic well-known multivariate statistical method focusing on the relationships between two sets of variables. The visualisation of those relationships can be achieved by means of a biplot of the between-set correlation matrix. The canonical analysis provides a low-rank approximation to the between-set correlation matrix that is optimal in a generalised least squares sense. This article proposes to adjust the between-set correlation matrix using either a single scalar effect, or column and/or row effects. An alternating generalised least squares algorithm is proposed to obtain optimal adjustments and low-rank factorisations. The adjustment leads to a better approximation of the between-set correlation matrix that achieves a lower root mean squared error in comparison with the classic canonical analysis. The results of the adjusted analysis can be efficiently visualised using biplots, with a minimal change in interpretation rules that only affects the biplot origin. Biplot calibration is used to enhance the visualisation of the results of the adjusted analysis. Some examples with publicly available data sets from social science, geochemistry and medical science illustrate the proposed improvement. Software for carrying out the adjusted canonical analysis in the R environment is provided.

On the approximation of the between-set correlation matrix by canonical correlation analysis

Abstract

Canonical correlation analysis is a classic well-known multivariate statistical method focusing on the relationships between two sets of variables. The visualisation of those relationships can be achieved by means of a biplot of the between-set correlation matrix. The canonical analysis provides a low-rank approximation to the between-set correlation matrix that is optimal in a generalised least squares sense. This article proposes to adjust the between-set correlation matrix using either a single scalar effect, or column and/or row effects. An alternating generalised least squares algorithm is proposed to obtain optimal adjustments and low-rank factorisations. The adjustment leads to a better approximation of the between-set correlation matrix that achieves a lower root mean squared error in comparison with the classic canonical analysis. The results of the adjusted analysis can be efficiently visualised using biplots, with a minimal change in interpretation rules that only affects the biplot origin. Biplot calibration is used to enhance the visualisation of the results of the adjusted analysis. Some examples with publicly available data sets from social science, geochemistry and medical science illustrate the proposed improvement. Software for carrying out the adjusted canonical analysis in the R environment is provided.

Paper Structure

This paper contains 10 sections, 11 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Biplots of the between-set correlation matrix of the psychology-achievement data obtained by CCA (A and B) and CCA-$\delta$ (C and D). Biplot vectors for rows in red, for columns in blue. In panels B and D correlation increments of 0.01, 0.05 and 0.1 scale are marked with grey, dark-grey and (larger) black dots. The negative part of the correlation scale is in red, the positive part in blue. Panels B and D illustrate the approximation of the correlations of the psychological variables with science with green perpendiculars.
  • Figure 2: Biplots of the between-set correlation matrix of the sandstone oil data. A: one-dimensional biplot obtained by CCA. B: two-dimensional biplot obtained by CCA. Three dots represent the scores of the canonical $Y$ variates (blue = Upper, red = Wilhelm, green = Sub-Mulinia). C: one-dimensional biplot obtained by CCA-$c$. D: two-dimensional biplot obtained by CCA-$c$. Biplot vectors for rows in red, for columns in blue. The RMSE is given between parentheses in the title of each panel. Sandstone vectors are calibrated with dots. Black (larger), dark-grey and light-grey dots represent increments of 0.1, 0.05 and 0.01 in the correlation scale. Green perpendiculars illustrate the approximation of the between-set correlations.
  • Figure 3: Two-dimensional biplots of the between-set correlation matrix of the cardiovascular dataset. A: standard CCA biplot. B: CCA-$c$ biplot with calibration of age and smoke. C: standard CCA biplot with over-plotted canonical variates (individuals). D: CCA-$c$ biplot with over-plotted adjusted canonical variates. Biplot vectors for rows in red, for columns in blue. Low-risk individuals coloured in green, intermediate/high-risk individuals coloured in red. The RMSE is given between parentheses in the title of each panel.