Table of Contents
Fetching ...

DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance

Seffi Cohen, Niv Goldshlager, Nurit Cohen-Inger, Bracha Shapira, Lior Rokach

TL;DR

DFPE introduces a training-free, subject-adaptive ensemble that preserves model diversity via fingerprint clustering, filters underperformers with a per-subject quantile, and applies exponential weighting for robust aggregation. On the MMLU benchmark, it achieves about a 3% gain in overall accuracy and a 5% boost in discipline-level accuracy over the best single model. The approach carefully balances diversity, competence, and adaptability, performing well across a wide range of disciplines while maintaining practical efficiency. This methodology offers a scalable path to improve multitask language understanding without fine-tuning, with potential extensions to larger pools and open-ended tasks.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities across various natural language processing tasks but often struggle to excel uniformly in diverse or complex domains. We propose a novel ensemble method - Diverse Fingerprint Ensemble (DFPE), which leverages the complementary strengths of multiple LLMs to achieve more robust performance. Our approach involves: (1) clustering models based on response "fingerprints" patterns, (2) applying a quantile-based filtering mechanism to remove underperforming models at a per-subject level, and (3) assigning adaptive weights to remaining models based on their subject-wise validation accuracy. In experiments on the Massive Multitask Language Understanding (MMLU) benchmark, DFPE outperforms the best single model by 3% overall accuracy and 5% in discipline-level accuracy. This method increases the robustness and generalization of LLMs and underscores how model selection, diversity preservation, and performance-driven weighting can effectively address challenging, multi-faceted language understanding tasks.

DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance

TL;DR

DFPE introduces a training-free, subject-adaptive ensemble that preserves model diversity via fingerprint clustering, filters underperformers with a per-subject quantile, and applies exponential weighting for robust aggregation. On the MMLU benchmark, it achieves about a 3% gain in overall accuracy and a 5% boost in discipline-level accuracy over the best single model. The approach carefully balances diversity, competence, and adaptability, performing well across a wide range of disciplines while maintaining practical efficiency. This methodology offers a scalable path to improve multitask language understanding without fine-tuning, with potential extensions to larger pools and open-ended tasks.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities across various natural language processing tasks but often struggle to excel uniformly in diverse or complex domains. We propose a novel ensemble method - Diverse Fingerprint Ensemble (DFPE), which leverages the complementary strengths of multiple LLMs to achieve more robust performance. Our approach involves: (1) clustering models based on response "fingerprints" patterns, (2) applying a quantile-based filtering mechanism to remove underperforming models at a per-subject level, and (3) assigning adaptive weights to remaining models based on their subject-wise validation accuracy. In experiments on the Massive Multitask Language Understanding (MMLU) benchmark, DFPE outperforms the best single model by 3% overall accuracy and 5% in discipline-level accuracy. This method increases the robustness and generalization of LLMs and underscores how model selection, diversity preservation, and performance-driven weighting can effectively address challenging, multi-faceted language understanding tasks.

Paper Structure

This paper contains 36 sections, 5 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: A pool of LLMs $\mathcal{M}$ is evaluated per subject $S_k \in \mathcal{S}$ using a small validation set $\mathcal{Q}_k$. Each model’s predictions and accuracy $\alpha_{i,k}$ are used to generate “fingerprints,” which are subsequently clustered to maintain diversity. Models failing to meet a subject-specific quantile threshold are removed, and the most accurate model is chosen from each cluster to form the representative set $\mathcal{M}_k^*$. An exponential weighting scheme is then applied to these representatives before their final, weighted votes are aggregated to produce the answer
  • Figure 2: Average Accuracy by Discipline. DFPE consistently outperforms the compared methods across a wide range of disciplines, highlighting broad-spectrum gains.
  • Figure 3: Sensitivity analysis. Left: Accuracy vs. Quantile Threshold; Middle: Accuracy vs. AccuracyFactor; Right: Accuracy vs. Epsilon - the Epsilon axis (log scale). Performance remains robust within moderate parameter ranges, easing the tuning process.
  • Figure 4: Distribution of selected models per subject. The variation in bar heights demonstrates how DFPE adapts its ensemble size to subject-specific requirements while maintaining efficiency.
  • Figure 5: Heatmap of model co-occurrences within clusters. Cell values indicate frequency of model pairs being selected together. The diagonal is zero by definition. Higher values (darker colors) suggest stronger complementarity between models.