Table of Contents
Fetching ...

Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

Zacharie Bugaud

Abstract

Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study this structure across 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5-3.6 independent voters and create a Misleading tier (1.5-6.5% of questions) where correlated majority errors destroy accuracy to 0% despite the best model being correct. We propose three family-aware methods. Hierarchical Family Voting (HFV) aggregates within families before voting across them, recovering +18-26 pp on the Misleading tier. QualRCCV, a training-free method weighting models by calibration, family quality, and inverse family size, is the first to beat calibrated voting on all three benchmarks (p<0.05). Learned Candidate Scoring (LCS) trains a cross-validated classifier to re-rank candidate answers using support breadth, family diversity, and model quality, achieving the largest gains: +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA -- all significant -- and is the only learned method that never degrades any benchmark. On VQAv2 test-standard (EvalAI), LCS reaches 87.83% with 12 models, confirming generalization.

Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

Abstract

Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study this structure across 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5-3.6 independent voters and create a Misleading tier (1.5-6.5% of questions) where correlated majority errors destroy accuracy to 0% despite the best model being correct. We propose three family-aware methods. Hierarchical Family Voting (HFV) aggregates within families before voting across them, recovering +18-26 pp on the Misleading tier. QualRCCV, a training-free method weighting models by calibration, family quality, and inverse family size, is the first to beat calibrated voting on all three benchmarks (p<0.05). Learned Candidate Scoring (LCS) trains a cross-validated classifier to re-rank candidate answers using support breadth, family diversity, and model quality, achieving the largest gains: +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA -- all significant -- and is the only learned method that never degrades any benchmark. On VQAv2 test-standard (EvalAI), LCS reaches 87.83% with 12 models, confirming generalization.
Paper Structure (57 sections, 1 theorem, 5 equations, 6 figures, 12 tables, 1 algorithm)

This paper contains 57 sections, 1 theorem, 5 equations, 6 figures, 12 tables, 1 algorithm.

Key Result

Proposition 1

Consider a binary question with $F$ families of sizes $n_1, \ldots, n_F$. Let $\rho_w$ be the average within-family error correlation and $\rho_b$ the average between-family correlation, and let $P_f$ be the accuracy of each family's internal ensemble. HFV outperforms flat voting when all of the fol

Figures (6)

  • Figure 1: Gap decomposition across benchmarks. Calibrated voting captures only a small fraction of the gap between single-best and oracle accuracy, especially on VQAv2 and GQA.
  • Figure 2: Hierarchical clustering (Ward linkage) on error correlation distance. Family-colored leaves reveal that architecture families cluster together, confirming correlated within-family errors.
  • Figure 3: Per-family accuracy across benchmarks. HFV helps when family quality is relatively balanced (VQAv2, GQA) but hurts when one family is dramatically weaker (InternVL3 at 49% on TextVQA).
  • Figure 4: Error PCA of model accuracy vectors on VQAv2. Architecture families (color-coded) cluster together in the error landscape, confirming family structure is a real property---not an assumption.
  • Figure 5: LCS vs. calibrated voting accuracy as a function of ensemble size on VQAv2 and GQA. LCS gains grow with pool size while calibrated voting degrades from within-family redundancy.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Proposition 1: When HFV outperforms flat voting