Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

Zacharie Bugaud

Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

Zacharie Bugaud

Abstract

Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study this structure across 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5-3.6 independent voters and create a Misleading tier (1.5-6.5% of questions) where correlated majority errors destroy accuracy to 0% despite the best model being correct. We propose three family-aware methods. Hierarchical Family Voting (HFV) aggregates within families before voting across them, recovering +18-26 pp on the Misleading tier. QualRCCV, a training-free method weighting models by calibration, family quality, and inverse family size, is the first to beat calibrated voting on all three benchmarks (p<0.05). Learned Candidate Scoring (LCS) trains a cross-validated classifier to re-rank candidate answers using support breadth, family diversity, and model quality, achieving the largest gains: +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA -- all significant -- and is the only learned method that never degrades any benchmark. On VQAv2 test-standard (EvalAI), LCS reaches 87.83% with 12 models, confirming generalization.

Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

Abstract

Paper Structure (57 sections, 1 theorem, 5 equations, 6 figures, 12 tables, 1 algorithm)

This paper contains 57 sections, 1 theorem, 5 equations, 6 figures, 12 tables, 1 algorithm.

Introduction
Related Work
Ensemble theory and diversity.
LLM and VLM ensembles.
Structured and hierarchical aggregation.
VQA benchmarks and evaluation.
Experimental Setup
Models.
Benchmarks.
Aggregation baselines.
Statistical testing.
Analysis: Family Structure in VLM Ensembles
The Ensemble Ceiling
Difficulty Taxonomy
Error Correlation Has Family Structure
...and 42 more sections

Key Result

Proposition 1

Consider a binary question with $F$ families of sizes $n_1, \ldots, n_F$. Let $\rho_w$ be the average within-family error correlation and $\rho_b$ the average between-family correlation, and let $P_f$ be the accuracy of each family's internal ensemble. HFV outperforms flat voting when all of the fol

Figures (6)

Figure 1: Gap decomposition across benchmarks. Calibrated voting captures only a small fraction of the gap between single-best and oracle accuracy, especially on VQAv2 and GQA.
Figure 2: Hierarchical clustering (Ward linkage) on error correlation distance. Family-colored leaves reveal that architecture families cluster together, confirming correlated within-family errors.
Figure 3: Per-family accuracy across benchmarks. HFV helps when family quality is relatively balanced (VQAv2, GQA) but hurts when one family is dramatically weaker (InternVL3 at 49% on TextVQA).
Figure 4: Error PCA of model accuracy vectors on VQAv2. Architecture families (color-coded) cluster together in the error landscape, confirming family structure is a real property---not an assumption.
Figure 5: LCS vs. calibrated voting accuracy as a function of ensemble size on VQAv2 and GQA. LCS gains grow with pool size while calibrated voting degrades from within-family redundancy.
...and 1 more figures

Theorems & Definitions (1)

Proposition 1: When HFV outperforms flat voting

Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

Abstract

Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

Authors

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (1)