Table of Contents
Fetching ...

A Unified Statistical And Computational Framework For Ex-Post Harmonisation Of Aggregate Statistics

Cynthia A. Huang

TL;DR

The Crossmaps Framework defines a new approach for transforming existing variables collected under a specific measurement or classification standard to an imputed counterfactual variable indexed by some target standard and is introduced through the example of ex-post harmonisation of aggregated statistics in the social sciences.

Abstract

Ex-post harmonisation is one of many data preprocessing processes used to combine the increasingly vast and diverse sources of data available for research and analysis. Documenting provenance and ensuring the quality of multi-source datasets is vital for ensuring trustworthy scientific research and encouraging reuse of existing harmonisation efforts. However, capturing and communicating statistically relevant properties of harmonised datasets is difficult without a universal standard for describing harmonisation operations. Our paper combines mathematical and computer science perspectives to address this need. The Crossmaps Framework defines a new approach for transforming existing variables collected under a specific measurement or classification standard to an imputed counterfactual variable indexed by some target standard. It uses computational graphs to separate intended transformation logic from actual data transformations, and avoid the risk of syntactically valid data manipulation scripts resulting in statistically questionable data. In this paper, we introduce the Crossmaps Framework through the example of ex-post harmonisation of aggregated statistics in the social sciences. We define a new provenance task abstraction, the crossmap transform, and formalise two associated objects, the shared mass array and the crossmap. We further define graph, matrix and list encodings of crossmaps and discuss resulting implications for understanding statistical properties of ex-post harmonisation and designing error minimising workflows.

A Unified Statistical And Computational Framework For Ex-Post Harmonisation Of Aggregate Statistics

TL;DR

The Crossmaps Framework defines a new approach for transforming existing variables collected under a specific measurement or classification standard to an imputed counterfactual variable indexed by some target standard and is introduced through the example of ex-post harmonisation of aggregated statistics in the social sciences.

Abstract

Ex-post harmonisation is one of many data preprocessing processes used to combine the increasingly vast and diverse sources of data available for research and analysis. Documenting provenance and ensuring the quality of multi-source datasets is vital for ensuring trustworthy scientific research and encouraging reuse of existing harmonisation efforts. However, capturing and communicating statistically relevant properties of harmonised datasets is difficult without a universal standard for describing harmonisation operations. Our paper combines mathematical and computer science perspectives to address this need. The Crossmaps Framework defines a new approach for transforming existing variables collected under a specific measurement or classification standard to an imputed counterfactual variable indexed by some target standard. It uses computational graphs to separate intended transformation logic from actual data transformations, and avoid the risk of syntactically valid data manipulation scripts resulting in statistically questionable data. In this paper, we introduce the Crossmaps Framework through the example of ex-post harmonisation of aggregated statistics in the social sciences. We define a new provenance task abstraction, the crossmap transform, and formalise two associated objects, the shared mass array and the crossmap. We further define graph, matrix and list encodings of crossmaps and discuss resulting implications for understanding statistical properties of ex-post harmonisation and designing error minimising workflows.
Paper Structure (47 sections, 3 theorems, 8 figures, 3 tables)

This paper contains 47 sections, 3 theorems, 8 figures, 3 tables.

Key Result

Corollary 3.1

For any valid crossmap transform that applies a crossmap$\mathcal{X}$ to a shared mass array$A_{[\mathcal{K},\mathbf{x}]}$, resulting in $A_{[\mathcal{T},\mathbf{y}]}$, numeric mass is preserved through the operation such that $\sum_{k=1}^{T} y_k = \sum_{i=1}^{K} x_i$.

Figures (8)

  • Figure 1: Decomposition of an Ex-Post Harmonisation Process for combining two source observations collected using different classifications. The source observation for USA is already in the target classification, represented by the letter index and green shading. However, the observation for AUS, totalling 140 units, was collected in alternative "source" classification, represented by the shape index and blue shading. Thus, in addition to any necessary source specific cleaning steps, the AUS observation also requires a Crossmap Transform into the target "green-letter" index.
  • Figure 2: Conceptual illustration of the Crossmaps Framework using the same harmonisation shown in Figure \ref{['fig-ex-post-process']}. The example shared mass array data inputs and outputs of a crossmap transform are shown either side of the crossmap input which specifies the mapping between source and target keys. The equivalent graph, matrix and list encodings of the crossmap are all illustrated.
  • Figure 3: Graph and List representations of a crossmap based on a subset of the crosswalk between the 2022 update of the Australian and New Zealand Standard Classification of Occupations (ANZSCO22) and the fourth iteration of the International Standard Classification of Occupations (ISCO08) published by the Australian Bureau of Statistics Australian Bureau of Statistics (2022)
  • Figure 4: Summary visualisation of a set of concurrent crossmap transforms applied to industry level output statistics collected according to country-year specific industry codes. Each tile represents a country-year observation of output (GDP) production in the INDSTAT4 Revision 3 Industry Level Dataset. The colour of the tile indicates whether that country-year observation contained industry codes and associated output values that were redistributed to the codes in the target ISIC classification
  • Figure 5: Stylised example of a data leakage error. The crossmap shown on the left-hand side does not contain mapping instructions for the source key x7285!. Thus, under a naive transformation the associated value 3895 could be lost.
  • ...and 3 more figures

Theorems & Definitions (12)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Corollary 3.1
  • proof
  • Definition 4.1
  • Definition 4.2
  • Corollary 4.1
  • proof
  • Proposition 4.1
  • ...and 2 more