Learning Tree-Structured Composition of Data Augmentation
Dongyue Li, Kailai Chen, Predrag Radivojac, Hongyang R. Zhang
TL;DR
The paper tackles the combinatorial challenge of learning effective data augmentation by introducing a binary tree-structured composition that models augmentation as a depth-d tree with 2^d−1 nodes. A greedy, top-down search constructs the tree in O(2^d k) time, significantly faster than the traditional O(k^d) worst case, and a density-matching approach evaluates candidates without retraining. It further extends to heterogeneous subpopulations by learning one tree per group and combining them into a forest with weights learned via a bilevel optimization, enabling robust, group-aware augmentation. Across graph and image benchmarks—including a newly collected AlphaFold-based protein graph dataset—the method achieves substantial runtime reductions (e.g., 43%) and improved performance (up to 4.3%), while providing interpretable importance scores for each transformation. These results suggest practical, scalable augmentation search with interpretable structures and principled handling of group shifts in real-world data.
Abstract
Data augmentation is widely used for training a neural network given little labeled data. A common practice of augmentation training is applying a composition of multiple transformations sequentially to the data. Existing augmentation methods such as RandAugment randomly sample from a list of pre-selected transformations, while methods such as AutoAugment apply advanced search to optimize over an augmentation set of size $k^d$, which is the number of transformation sequences of length $d$, given a list of $k$ transformations. In this paper, we design efficient algorithms whose running time complexity is much faster than the worst-case complexity of $O(k^d)$, provably. We propose a new algorithm to search for a binary tree-structured composition of $k$ transformations, where each tree node corresponds to one transformation. The binary tree generalizes sequential augmentations, such as the SimCLR augmentation scheme for contrastive learning. Using a top-down, recursive search procedure, our algorithm achieves a runtime complexity of $O(2^d k)$, which is much faster than $O(k^d)$ as $k$ increases above $2$. We apply our algorithm to tackle data distributions with heterogeneous subpopulations by searching for one tree in each subpopulation and then learning a weighted combination, resulting in a forest of trees. We validate our proposed algorithms on numerous graph and image datasets, including a multi-label graph classification dataset we collected. The dataset exhibits significant variations in the sizes of graphs and their average degrees, making it ideal for studying data augmentation. We show that our approach can reduce the computation cost by 43% over existing search methods while improving performance by 4.3%. The tree structures can be used to interpret the relative importance of each transformation, such as identifying the important transformations on small vs. large graphs.
