Table of Contents
Fetching ...

GeNeX: Genetic Network eXperts framework for addressing Validation Overfitting

Emmanuel Pintelas, Ioannis E. Livieris

Abstract

Excessive reliance on validation performance during model selection can lead to validation overfitting (VO), where models appear effective during development but fail at test time. This issue is further amplified in low-data regimes and under distribution shifts, where validation signals become unreliable. Although ensemble learning is widely used to improve robustness and generalization, most ensemble construction pipelines depend heavily on validation scores, leaving them vulnerable to VO and limiting their reliability in real-world deployment scenarios. To address this, we propose GeNeX (Genetic Network eXperts), a framework that mitigates validation overfitting at both model generation and ensemble construction stages. In the generation phase, GeNeX employs a dual-path strategy: gradient-based training is coupled with genetic model evolution. Offspring networks are created via crossover of trained parents, promoting structural diversity and weight-level regeneration without relying on validation feedback. This results in a candidate pool of robust, non-overfitted models. During ensemble construction, the candidate networks are clustered by prediction behavior to identify complementary model spaces. Within each cluster, multiple diverse experts are selected using criteria such as robustness and representativeness, and fused at the weight level to form compact prototype networks. The final ensemble aggregates these prototypes, with predictions optimized via Sequential Quadratic Programming for output-level synergy. To rigorously evaluate VO resilience, we introduce a VO-aware evaluation protocol that simulates realistic deployment scenarios by enforcing distributional divergence between training and test subsets.

GeNeX: Genetic Network eXperts framework for addressing Validation Overfitting

Abstract

Excessive reliance on validation performance during model selection can lead to validation overfitting (VO), where models appear effective during development but fail at test time. This issue is further amplified in low-data regimes and under distribution shifts, where validation signals become unreliable. Although ensemble learning is widely used to improve robustness and generalization, most ensemble construction pipelines depend heavily on validation scores, leaving them vulnerable to VO and limiting their reliability in real-world deployment scenarios. To address this, we propose GeNeX (Genetic Network eXperts), a framework that mitigates validation overfitting at both model generation and ensemble construction stages. In the generation phase, GeNeX employs a dual-path strategy: gradient-based training is coupled with genetic model evolution. Offspring networks are created via crossover of trained parents, promoting structural diversity and weight-level regeneration without relying on validation feedback. This results in a candidate pool of robust, non-overfitted models. During ensemble construction, the candidate networks are clustered by prediction behavior to identify complementary model spaces. Within each cluster, multiple diverse experts are selected using criteria such as robustness and representativeness, and fused at the weight level to form compact prototype networks. The final ensemble aggregates these prototypes, with predictions optimized via Sequential Quadratic Programming for output-level synergy. To rigorously evaluate VO resilience, we introduce a VO-aware evaluation protocol that simulates realistic deployment scenarios by enforcing distributional divergence between training and test subsets.
Paper Structure (36 sections, 14 equations, 3 figures, 2 tables, 3 algorithms)

This paper contains 36 sections, 14 equations, 3 figures, 2 tables, 3 algorithms.

Figures (3)

  • Figure 1: GeNeX overview. GenE generates a diverse pool $\mathcal{M}$ without validation monitoring via short supervised training and genetic crossover/mutation, encouraging broad weight-space exploration and limiting early validation dependence. ProtoNeX clusters models in behavior (prediction) space, elects complementary experts via multi-criteria selection, and fuses them into $K$ compact prototypes. Instead of clustering data and training separate models per cluster, ProtoNeX clusters the models themselves and uses prototype fusion, promoting complementarity and behavioral diversity across the model space to enhance generalization.
  • Figure 2: Distribution-Shifted Train/Test partitioning based on the proposed JSD-guided clustering algorithm applied on Skin Cancer dataset. The goal of the method is to partition a dataset into training and test subsets such that the JSD between their distributions is maximized. This process generates distributionally shifted splits that simulate realistic and challenging learning scenarios. Such splits are useful for reliably benchmarking a model’s robustness to validation overfitting. Visual inspection reveals distinct data characteristics across the two subsets, emphasizing the diversity of the resulting partitions.
  • Figure 3: Visualization of Train–Val Overfitting (TO) and Val–Test Overfitting (VO) gaps for two parent networks A and B and their genetically evolved child models. The study is conducted on the most challenging VO-pruned dataset, GS-DeepFake, which exhibits the highest distributional shift (JSD = 0.545). We observe that despite minimal TO and high validation performance (epochs 12--18 for A), severe VO gaps can still occur, leading to unreliable selection. After genetic crossover (vertical line), child models demonstrate improved generalization and reduced overfitting.