Table of Contents
Fetching ...

Improving the performance of weak supervision searches using data augmentation

Zong-En Chen, Cheng-Wei Chiang, Feng-Yang Hsieh

TL;DR

This work addresses the data-efficiency challenge in weakly supervised collider searches by introducing physics-inspired data augmentation to the CWoLa framework. By applying $p_T$ smearing and jet rotation (and their combination) to jet images, the authors significantly reduce the learning threshold from about $6\sigma$ to $\sim3\sigma$, enabling more robust discrimination between signal and background with substantially fewer labeled events. The study leverages a Hidden Valley benchmark with Z' mediation to demonstrate that EN normalization effectively mitigates sculpting, and that the combined augmentation provides the strongest gains, even under moderate systematic uncertainties. Overall, physics-informed data augmentation emerges as a practical, data-efficient tool to enhance weakly supervised learning in collider searches, with potential applicability beyond the specific HV scenario.

Abstract

Weak supervision combines the advantages of training on real data with the ability to exploit signal properties. However, training a neural network using weak supervision often requires an excessive amount of signal data, which severely limits its practical applicability. In this study, we propose addressing this limitation through data augmentation, increasing the training data's size and diversity. Specifically, we focus on physics-inspired data augmentation methods, such as $p_{\text{T}}$ smearing and jet rotation. Our results demonstrate that data augmentation can significantly enhance the performance of weak supervision, enabling neural networks to learn efficiently from substantially less data.

Improving the performance of weak supervision searches using data augmentation

TL;DR

This work addresses the data-efficiency challenge in weakly supervised collider searches by introducing physics-inspired data augmentation to the CWoLa framework. By applying smearing and jet rotation (and their combination) to jet images, the authors significantly reduce the learning threshold from about to , enabling more robust discrimination between signal and background with substantially fewer labeled events. The study leverages a Hidden Valley benchmark with Z' mediation to demonstrate that EN normalization effectively mitigates sculpting, and that the combined augmentation provides the strongest gains, even under moderate systematic uncertainties. Overall, physics-informed data augmentation emerges as a practical, data-efficient tool to enhance weakly supervised learning in collider searches, with potential applicability beyond the specific HV scenario.

Abstract

Weak supervision combines the advantages of training on real data with the ability to exploit signal properties. However, training a neural network using weak supervision often requires an excessive amount of signal data, which severely limits its practical applicability. In this study, we propose addressing this limitation through data augmentation, increasing the training data's size and diversity. Specifically, we focus on physics-inspired data augmentation methods, such as smearing and jet rotation. Our results demonstrate that data augmentation can significantly enhance the performance of weak supervision, enabling neural networks to learn efficiently from substantially less data.

Paper Structure

This paper contains 16 sections, 5 equations, 8 figures.

Figures (8)

  • Figure 1: The architecture of the neural network and model hyperparameters.
  • Figure 2: The invariant mass $m_{jj}$ histogram and the NN cut passing efficiency $\varepsilon$ as functions of $m_{jj}$, with different sideband efficiencies $\varepsilon_\text{SB}$.
  • Figure 3: The sensitivities before and after the NN selection. The gray dotted line represents the sensitivity before NN selection. The error bar is the standard deviation of 10 times training.
  • Figure 4: The jet images before and after different data augmentation methods.
  • Figure 5: The sensitivities before and after the NN selection. The gray dotted line represents the sensitivity before NN selection. The error bar is the standard deviation of 10 times training. The "$p_{\text{T}}$-rot" means the "$p_{\text{T}}$ smearing + jet rotation" augmentation method.
  • ...and 3 more figures