Increasing Information Extraction in Low-Signal Regimes via Multiple Instance Learning
Atakan Azakli, Bernd Stelzer
TL;DR
Low-signal hypothesis testing in SMEFT contexts often underperforms with single-instance ML. We propose MIL as an information-theoretic framework that aggregates events into bags to boost discriminative signal and derive how bag-level information boosts effective Fisher information. The paper presents theory, a practical calibration for Bartlett identity violations, and comprehensive experiments (binary, multi-class, and parameterized nets) demonstrating MIL's resilience and FI gains. Limitations include simplified data and i.i.d. assumptions, with future work on σ_ε(N_B) modeling and MIL-architecture design to maximize set-level sufficiency.
Abstract
In this work, we introduce a new information-theoretic perspective on Multiple Instance Learning (MIL) for parameter estimation with i.i.d. data, and show that MIL can outperform single-instance learners in low-signal regimes. Prior work [Nachman and Thaler, 2021] argued that single-instance methods are often sufficient, but this conclusion presumes enough single-instance signal to train near-optimal classifiers. We demonstrate that even state-of-the-art single-instance models can fail to reach optimal classifier performance in challenging low-signal regimes, whereas MIL can mitigate this sub-optimality. As a concrete application, we constrain Wilson coefficients of the Standard Model Effective Field Theory (SMEFT) using kinematic information from subatomic particle collision events at the Large Hadron Collider (LHC). In experiments, we observe that under specific modeling and weak signal conditions, pooling instances can increase the effective Fisher information compared to single-instance approaches.
