Table of Contents
Fetching ...

A Survey on Archetypal Analysis

Aleix Alcacer, Irene Epifanio, Sebastian Mair, Morten Mørup

TL;DR

Archetypal Analysis (AA) provides an interpretable, geometry-based framework to represent each observation as a convex combination of a small set of extreme archetypes that lie on the data's convex hull. The paper surveys the mathematical formulation, optimization approaches, extensions (e.g., kernel AA, archetypoids, BiAA), robustness and missing-data strategies, and a wide array of applications across life sciences, physics/chemistry, climate science, computer science, and social sciences. It also discusses practical concerns such as initialization, model-order selection, scalability, and reproducibility, and outlines future directions including non-linear extensions, temporal dynamics, and automated selection of the number of archetypes. Overall, the work positions AA as a versatile, interpretable tool that complements clustering and matrix factorization, while identifying key limitations and open problems for ongoing research.

Abstract

Archetypal analysis (AA) was originally proposed in 1994 by Adele Cutler and Leo Breiman as a computational procedure for extracting distinct aspects, so-called archetypes, from observations, with each observational record approximated as a mixture (i.e., convex combination) of these archetypes. AA thereby provides straightforward, interpretable, and explainable representations for feature extraction and dimensionality reduction, facilitating the understanding of the structure of high-dimensional data and enabling wide applications across the sciences. However, AA also faces challenges, particularly as the associated optimization problem is non-convex. This is the first survey that provides researchers and data mining practitioners with an overview of the methodologies and opportunities that AA offers, surveying the many applications of AA across disparate fields of science, as well as best practices for modeling data with AA and its limitations. The survey concludes by explaining crucial future research directions concerning AA.

A Survey on Archetypal Analysis

TL;DR

Archetypal Analysis (AA) provides an interpretable, geometry-based framework to represent each observation as a convex combination of a small set of extreme archetypes that lie on the data's convex hull. The paper surveys the mathematical formulation, optimization approaches, extensions (e.g., kernel AA, archetypoids, BiAA), robustness and missing-data strategies, and a wide array of applications across life sciences, physics/chemistry, climate science, computer science, and social sciences. It also discusses practical concerns such as initialization, model-order selection, scalability, and reproducibility, and outlines future directions including non-linear extensions, temporal dynamics, and automated selection of the number of archetypes. Overall, the work positions AA as a versatile, interpretable tool that complements clustering and matrix factorization, while identifying key limitations and open problems for ongoing research.

Abstract

Archetypal analysis (AA) was originally proposed in 1994 by Adele Cutler and Leo Breiman as a computational procedure for extracting distinct aspects, so-called archetypes, from observations, with each observational record approximated as a mixture (i.e., convex combination) of these archetypes. AA thereby provides straightforward, interpretable, and explainable representations for feature extraction and dimensionality reduction, facilitating the understanding of the structure of high-dimensional data and enabling wide applications across the sciences. However, AA also faces challenges, particularly as the associated optimization problem is non-convex. This is the first survey that provides researchers and data mining practitioners with an overview of the methodologies and opportunities that AA offers, surveying the many applications of AA across disparate fields of science, as well as best practices for modeling data with AA and its limitations. The survey concludes by explaining crucial future research directions concerning AA.

Paper Structure

This paper contains 27 sections, 3 theorems, 18 equations, 14 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathcal{X} \subset \mathbb{R}^M$ be a discrete dataset, $\operatorname{conv}(\mathcal{X})$ be its convex hull and $\bm{\mu} \in \mathbb{R}^M$ be the mean of $\mathcal{X}$. Furthermore, let $K \in \mathbb{N}$ be the number of archetypes and $\partial\mathcal{X}$ be the boundary of $\mathcal{X}$

Figures (14)

  • Figure 1: AA computed on the subset of the MNIST handwritten digit dataset showing solely the digit 9 with three archetypes. The left part depicts the mixing weights of the nines whereas the right part depicts the actual handwritten digits.
  • Figure 2: Citation overview of paper cutler1994archetypal in Google Scholar and SCOPUS until 5th November 2025.
  • Figure 3: An example of AA with $K=3$ archetypes on two-dimensional toy data. After initializing the archetypes (purple squares), the optimization procedure pushes the archetypes (green squares) to lie on the boundary of the convex hull of data (see Theorem \ref{['thm:background:cutler']}). The overall objective is to minimize the sum of projection errors (i.e., the sum of all dashed lines).
  • Figure 4: An example of AA in two dimensions for various numbers ($K=2,3,4$) of archetypes $\{\mathbf{a}_1,\ldots,\mathbf{a}_K\}$. The archetypes are always located on the boundary of the convex hull of data.
  • Figure 5: AA with $K=3$ archetypes compared to a non-negative matrix factorization (NMF) with $K=2$ components, a principal component analysis (PCA) with $K=2$ components (first is solid, second dashed), $k$-means clustering with $K=3$ clusters, and $k$-maxoids clustering with $K=3$ clusters.
  • ...and 9 more figures

Theorems & Definitions (3)

  • Theorem 1: cutler1994archetypal
  • Lemma 1: morup2012archetypal
  • Theorem 2: morup2012archetypal