Table of Contents
Fetching ...

Loglinear modelling of huge contingency tables

Veronica Vinciotti, Ernst C. Wit

TL;DR

An efficient method for inferring higher-order loglinear models in contingency table scenarios where the number of sampled empty cells exceeds the number of observations is huge, by using only a sample of the empty cells and deriving the associated likelihood under a Poisson sampling scheme.

Abstract

Contingency tables are a fundamental representation of multivariate categorical data. As the size of the contingency table grows exponentially with the number of variables, even a moderate number of variables, each with a moderate number of levels, will result in a huge number of cells, the majority of which will remain empty even with a significant amount of data. We propose an efficient method for inferring higher-order loglinear models in such scenarios. We tackle the computational challenge by using only a sample of the empty cells and deriving the associated likelihood under a Poisson sampling scheme. This allows us to define an iteratively re-weighted least squares (IRWLS) algorithm for parameter estimation. Under the extreme setting of huge contingency tables, we show how standard Poisson regression on the sampled data converges to this IRWLS scheme, when the number of sampled empty cells exceeds the number of observations. We illustrate the method with an analysis of data from the General Social Survey, which consists of 15014 observations in a 70-dimensional contingency table with a total of 2.6 x 10^{39} cells.

Loglinear modelling of huge contingency tables

TL;DR

An efficient method for inferring higher-order loglinear models in contingency table scenarios where the number of sampled empty cells exceeds the number of observations is huge, by using only a sample of the empty cells and deriving the associated likelihood under a Poisson sampling scheme.

Abstract

Contingency tables are a fundamental representation of multivariate categorical data. As the size of the contingency table grows exponentially with the number of variables, even a moderate number of variables, each with a moderate number of levels, will result in a huge number of cells, the majority of which will remain empty even with a significant amount of data. We propose an efficient method for inferring higher-order loglinear models in such scenarios. We tackle the computational challenge by using only a sample of the empty cells and deriving the associated likelihood under a Poisson sampling scheme. This allows us to define an iteratively re-weighted least squares (IRWLS) algorithm for parameter estimation. Under the extreme setting of huge contingency tables, we show how standard Poisson regression on the sampled data converges to this IRWLS scheme, when the number of sampled empty cells exceeds the number of observations. We illustrate the method with an analysis of data from the General Social Survey, which consists of 15014 observations in a 70-dimensional contingency table with a total of 2.6 x 10^{39} cells.
Paper Structure (14 sections, 32 equations, 5 figures)

This paper contains 14 sections, 32 equations, 5 figures.

Figures (5)

  • Figure 1: Generating data for $p=13$ categorical variables with 3 levels each from a two-way model. Comparison between sampled data conditional likelihood (GLM0) and Poisson likelihood on sampled data (GLM) in terms of bias of the estimated interaction effects $\boldsymbol{\lambda}$ across $10$ replications of each setting. The smaller the number of sampled zero cells ($n_0$) compared to the total counts ($n_1$), the higher the bias when the sampling strategy is not accounted for. A small number of zeros is sufficient for standard GLM to work well on moderately-sized datasets.
  • Figure 2: Generating data for $p=13$ categorical variables with 3 levels each from a two-way model with a sparse banded structure and sampling varying percentages of zeros. Model selection via a BIC stepwise procedure based either on the sampled data conditional likelihood (GLM0) or on the Poisson likelihood on sampled data (GLM) is evaluated according to (a) average BIC and (b) $F_1$-score of the optimal model, across $10$ replicates for each setting: for a number of sampled zeros ($n_0$) that is only an order of magnitude higher than the total counts ($n_1$), GLM performs equally well to the correctly specified sampled conditional likelihood GLM0.
  • Figure 3: Generating data for $p=20$ binary variables from a two-way model with a sparse banded structure and sampling a number of zeros ($n_0$) that is $20$ times larger than the total counts ($n_1$). Model selection is conducted via Poisson penalized regression with AIC or BIC-based selection among all two-way interactions (lasso), AIC or BIC-stepwise graph search (stepwise), stochastic graph search using pseudo-likelihood (pseudo). Evaluation in terms of (a) $F_1$-score, (b) sensitivity and (c) specificity of the optimal model, across $10$ replicates for each setting, shows how, among the stepwise methods, AIC selects denser models than BIC, while neighbourhood selection via pseudo-likelihood shows accurate detection of a sparse model.
  • Figure 4: Optimal network on survey data, found by AIC-based stepwise graph search using Poisson likelihood on $n_1=15014$ counts and $n_0=10n_1$ randomly sampled empty cells (GLM method).
  • Figure 5: (a) AIC of a model with main effects and edges that were sequentially added (+) or removed (-) during the stepwise graph search on survey data. (b)-(c) Estimated two-way effect for two of the interactions in the final log-linear graphical model: those who think that taxes are too high tend to think that government should spend less on health and education. Those who do not provide an answer to government spending, probably due to lack of knowledge, are more likely to say that taxes are high.