Learning Mixtures of Arbitrary Distributions over Large Discrete Domains

Yuval Rabani; Leonard Schulman; Chaitanya Swamy

Paper

Learning Mixtures of Arbitrary Distributions over Large Discrete Domains

Abstract

We give an algorithm for learning a mixture of {\em unstructured} distributions. This problem arises in various unsupervised learning scenarios, for example in learning {\em topic models} from a corpus of documents spanning several topics. We show how to learn the constituents of a mixture of

arbitrary distributions over a large discrete domain

and the mixture weights, using

samples. (In the topic-model learning setting, the mixture constituents correspond to the topic distributions.) This task is information-theoretically impossible for

under the usual sampling process from a mixture distribution. However, there are situations (such as the above-mentioned topic model case) in which each sample point consists of several observations from the same mixture constituent. This number of observations, which we call the {\em "sampling aperture"}, is a crucial parameter of the problem. We obtain the {\em first} bounds for this mixture-learning problem {\em without imposing any assumptions on the mixture constituents.} We show that efficient learning is possible exactly at the information-theoretically least-possible aperture of

. Thus, we achieve near-optimal dependence on

and optimal aperture. While the sample-size required by our algorithm depends exponentially on

, we prove that such a dependence is {\em unavoidable} when one considers general mixtures. A sequence of tools contribute to the algorithm, such as concentration results for random matrices, dimension reduction, moment estimations, and sensitivity analysis.