Dissimilar Batch Decompositions of Random Datasets
Ghurumuruhan Ganesan
TL;DR
The paper studies dissimilar batch decompositions for large, randomly drawn datasets to ensure diversity within batches. It builds a probabilistic model where each data point has a continuous part and a categorical part with possible corruption, and defines a similarity relation within batches. Using martingale difference bounds, concentration inequalities, and the Lovász Local Lemma, it derives high-probability bounds on the minimum required batch size under a per-batch similarity constraint and shows that relaxing the similarity constraint reduces batch size. It also analyzes the maximum similarity-limited subset size and provides variance bounds, together outlining tradeoffs that guide batch construction in learning pipelines.
Abstract
For better learning, large datasets are often split into small batches and fed sequentially to the predictive model. In this paper, we study such batch decompositions from a probabilistic perspective. We assume that data points (possibly corrupted) are drawn independently from a given space and define a concept of similarity between two data points. We then consider decompositions that restrict the amount of similarity within each batch and obtain high probability bounds for the minimum size. We demonstrate an inherent tradeoff between relaxing the similarity constraint and the overall size and also use martingale methods to obtain bounds for the maximum size of data subsets with a given similarity.
