Table of Contents
Fetching ...

Majority Rules: LLM Ensemble is a Winning Approach for Content Categorization

Ariel Kamen, Yakov Kamen

TL;DR

The paper addresses the challenge of taxonomy-based content categorization with zero-shot LLMs, which suffer from instability, hallucination, and category inflation. It proposes an ensemble framework (eLLM) that treats multiple LLMs as independent experts and aggregates their outputs using a Collective Decision-Making (CDM) criterion tailored to sparse hierarchical taxonomies, exemplified on the IAB taxonomy. Empirical results across ten LLMs and ensemble sizes up to 10 demonstrate substantial F1-score gains (up to ~67% over the best single model) and reduced hallucinations, approaching human-expert levels in some settings while highlighting the cost trade-offs of multi-model inference. The work provides a formal mathematical foundation, a robust evaluation protocol, and practical guidance for deploying ensemble-based taxonomy classification in scalable labeling pipelines, with future directions toward dynamic ensembles and weighting strategies to balance accuracy and efficiency.

Abstract

This study introduces an ensemble framework for unstructured text categorization using large language models (LLMs). By integrating multiple models, the ensemble large language model (eLLM) framework addresses common weaknesses of individual systems, including inconsistency, hallucination, category inflation, and misclassification. The eLLM approach yields a substantial performance improvement of up to 65\% in F1-score over the strongest single model. We formalize the ensemble process through a mathematical model of collective decision-making and establish principled aggregation criteria. Using the Interactive Advertising Bureau (IAB) hierarchical taxonomy, we evaluate ten state-of-the-art LLMs under identical zero-shot conditions on a human-annotated corpus of 8{,}660 samples. Results show that individual models plateau in performance due to the compression of semantically rich text into sparse categorical representations, while eLLM improves both robustness and accuracy. With a diverse consortium of models, eLLM achieves near human-expert-level performance, offering a scalable and reliable solution for taxonomy-based classification that may significantly reduce dependence on human expert labeling.

Majority Rules: LLM Ensemble is a Winning Approach for Content Categorization

TL;DR

The paper addresses the challenge of taxonomy-based content categorization with zero-shot LLMs, which suffer from instability, hallucination, and category inflation. It proposes an ensemble framework (eLLM) that treats multiple LLMs as independent experts and aggregates their outputs using a Collective Decision-Making (CDM) criterion tailored to sparse hierarchical taxonomies, exemplified on the IAB taxonomy. Empirical results across ten LLMs and ensemble sizes up to 10 demonstrate substantial F1-score gains (up to ~67% over the best single model) and reduced hallucinations, approaching human-expert levels in some settings while highlighting the cost trade-offs of multi-model inference. The work provides a formal mathematical foundation, a robust evaluation protocol, and practical guidance for deploying ensemble-based taxonomy classification in scalable labeling pipelines, with future directions toward dynamic ensembles and weighting strategies to balance accuracy and efficiency.

Abstract

This study introduces an ensemble framework for unstructured text categorization using large language models (LLMs). By integrating multiple models, the ensemble large language model (eLLM) framework addresses common weaknesses of individual systems, including inconsistency, hallucination, category inflation, and misclassification. The eLLM approach yields a substantial performance improvement of up to 65\% in F1-score over the strongest single model. We formalize the ensemble process through a mathematical model of collective decision-making and establish principled aggregation criteria. Using the Interactive Advertising Bureau (IAB) hierarchical taxonomy, we evaluate ten state-of-the-art LLMs under identical zero-shot conditions on a human-annotated corpus of 8{,}660 samples. Results show that individual models plateau in performance due to the compression of semantically rich text into sparse categorical representations, while eLLM improves both robustness and accuracy. With a diverse consortium of models, eLLM achieves near human-expert-level performance, offering a scalable and reliable solution for taxonomy-based classification that may significantly reduce dependence on human expert labeling.

Paper Structure

This paper contains 36 sections, 30 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Overview of the IAB Taxonomy