Improving the performance of weak supervision searches using data augmentation
Zong-En Chen, Cheng-Wei Chiang, Feng-Yang Hsieh
TL;DR
This work addresses the data-efficiency challenge in weakly supervised collider searches by introducing physics-inspired data augmentation to the CWoLa framework. By applying $p_T$ smearing and jet rotation (and their combination) to jet images, the authors significantly reduce the learning threshold from about $6\sigma$ to $\sim3\sigma$, enabling more robust discrimination between signal and background with substantially fewer labeled events. The study leverages a Hidden Valley benchmark with Z' mediation to demonstrate that EN normalization effectively mitigates sculpting, and that the combined augmentation provides the strongest gains, even under moderate systematic uncertainties. Overall, physics-informed data augmentation emerges as a practical, data-efficient tool to enhance weakly supervised learning in collider searches, with potential applicability beyond the specific HV scenario.
Abstract
Weak supervision combines the advantages of training on real data with the ability to exploit signal properties. However, training a neural network using weak supervision often requires an excessive amount of signal data, which severely limits its practical applicability. In this study, we propose addressing this limitation through data augmentation, increasing the training data's size and diversity. Specifically, we focus on physics-inspired data augmentation methods, such as $p_{\text{T}}$ smearing and jet rotation. Our results demonstrate that data augmentation can significantly enhance the performance of weak supervision, enabling neural networks to learn efficiently from substantially less data.
