Table of Contents
Fetching ...

Archetypal Analysis for Binary Data

A. Emilie J. Wedenborg, Morten Mørup

TL;DR

This paper addresses archetypal analysis for binary data by introducing two optimized frameworks: (i) a likelihood-based AA using a second-order Bernoulli expansion with SMO for the $S$-update and an active-set approach for the $C$-update, and (ii) a Bernoulli-likelihood extension of PCHA (Bernoulli-PCHA). The authors derive efficient closed-form updates and gradients tailored to Bernoulli data, demonstrate faster convergence and comparable reconstruction quality versus existing multiplicative updates on synthetic and real binary datasets, and show the framework's extendability to other distributions. They provide a general reconstruction model $R = P C S$ with $P = X + \varepsilon - 2 X \varepsilon$ to accommodate binary data, and evaluate model stability via Normalized Mutual Information across multiple runs. The work offers practical, scalable AA for binary data and lays groundwork for applying likelihood-tailored AA to a range of distributions, with noted trade-offs between SMO-AS speed and active-set C-update scalability.

Abstract

Archetypal analysis (AA) is a matrix decomposition method that identifies distinct patterns using convex combinations of the data points denoted archetypes with each data point in turn reconstructed as convex combinations of the archetypes. AA thereby forms a polytope representing trade-offs of the distinct aspects in the data. Most existing methods for AA are designed for continuous data and do not exploit the structure of the data distribution. In this paper, we propose two new optimization frameworks for archetypal analysis for binary data. i) A second order approximation of the AA likelihood based on the Bernoulli distribution with efficient closed-form updates using an active set procedure for learning the convex combinations defining the archetypes, and a sequential minimal optimization strategy for learning the observation specific reconstructions. ii) A Bernoulli likelihood based version of the principal convex hull analysis (PCHA) algorithm originally developed for least squares optimization. We compare these approaches with the only existing binary AA procedure relying on multiplicative updates and demonstrate their superiority on both synthetic and real binary data. Notably, the proposed optimization frameworks for AA can easily be extended to other data distributions providing generic efficient optimization frameworks for AA based on tailored likelihood functions reflecting the underlying data distribution.

Archetypal Analysis for Binary Data

TL;DR

This paper addresses archetypal analysis for binary data by introducing two optimized frameworks: (i) a likelihood-based AA using a second-order Bernoulli expansion with SMO for the -update and an active-set approach for the -update, and (ii) a Bernoulli-likelihood extension of PCHA (Bernoulli-PCHA). The authors derive efficient closed-form updates and gradients tailored to Bernoulli data, demonstrate faster convergence and comparable reconstruction quality versus existing multiplicative updates on synthetic and real binary datasets, and show the framework's extendability to other distributions. They provide a general reconstruction model with to accommodate binary data, and evaluate model stability via Normalized Mutual Information across multiple runs. The work offers practical, scalable AA for binary data and lays groundwork for applying likelihood-tailored AA to a range of distributions, with noted trade-offs between SMO-AS speed and active-set C-update scalability.

Abstract

Archetypal analysis (AA) is a matrix decomposition method that identifies distinct patterns using convex combinations of the data points denoted archetypes with each data point in turn reconstructed as convex combinations of the archetypes. AA thereby forms a polytope representing trade-offs of the distinct aspects in the data. Most existing methods for AA are designed for continuous data and do not exploit the structure of the data distribution. In this paper, we propose two new optimization frameworks for archetypal analysis for binary data. i) A second order approximation of the AA likelihood based on the Bernoulli distribution with efficient closed-form updates using an active set procedure for learning the convex combinations defining the archetypes, and a sequential minimal optimization strategy for learning the observation specific reconstructions. ii) A Bernoulli likelihood based version of the principal convex hull analysis (PCHA) algorithm originally developed for least squares optimization. We compare these approaches with the only existing binary AA procedure relying on multiplicative updates and demonstrate their superiority on both synthetic and real binary data. Notably, the proposed optimization frameworks for AA can easily be extended to other data distributions providing generic efficient optimization frameworks for AA based on tailored likelihood functions reflecting the underlying data distribution.

Paper Structure

This paper contains 9 sections, 19 equations, 3 figures.

Figures (3)

  • Figure 1: Top panel: Results of the models on synthetic Gaussian data. Bottom panel: Results of the models on synthetic Bernoulli data. The leftmost column represents the loss convergence properties. The middle plots shows the loss for different numbers of archetypes. The left column displays the NMI which in this case is used as a measure of the stability of the solution.
  • Figure 2: Comparison of SMO and FNNLS Bro1997AAlgorithm. In all instances of K (number of archetypes), the SMO updates converges to the optimal solution within $K^2$ iterations. The optimal solution for the different $K$'s are marked by the dotted black lines identified by FNNLS.
  • Figure 3: Performance visualizations on the SIDER data set. From (a) and (b) we observe that while SMO-AS and B-PCHA converges within fewer iterations, all three models converge to approximately the same loss and with comparable stability (NMI), although the solutions found by Multiplicative updates are less stable, especially for higher number of archetypes (c). From (d) it can be seen that the SMO-AS model clearly outperforms the other models in terms of runtime.