Table of Contents
Fetching ...

A Case Study of Scalable Content Annotation Using Multi-LLM Consensus and Human Review

Mingyue Yuan, Jieshan Chen, Zhenchang Xing, Gelareh Mohammadi, Aaron Quigley

TL;DR

The paper addresses scalable content annotation by integrating multiple LLMs with targeted human review to balance automation and accuracy. The proposed MCHR framework combines independent model analyses, a structured consensus mechanism, and human-in-the-loop review, evaluated on the COMMITPACKFT dataset spanning 277 languages. Results show strong accuracy across four classification levels and substantial reductions in annotation time, with open-set cases benefiting most from human refinement. The study provides practical guidance for deploying human-AI collaboration in high-volume annotation tasks and highlights taxonomy challenges in open-set regimes.

Abstract

Content annotation at scale remains challenging, requiring substantial human expertise and effort. This paper presents a case study in code documentation analysis, where we explore the balance between automation efficiency and annotation accuracy. We present MCHR (Multi-LLM Consensus with Human Review), a novel semi-automated framework that enhances annotation scalability through the systematic integration of multiple LLMs and targeted human review. Our framework introduces a structured consensus-building mechanism among LLMs and an adaptive review protocol that strategically engages human expertise. Through our case study, we demonstrate that MCHR reduces annotation time by 32% to 100% compared to manual annotation while maintaining high accuracy (85.5% to 98%) across different difficulty levels, from basic binary classification to challenging open-set scenarios.

A Case Study of Scalable Content Annotation Using Multi-LLM Consensus and Human Review

TL;DR

The paper addresses scalable content annotation by integrating multiple LLMs with targeted human review to balance automation and accuracy. The proposed MCHR framework combines independent model analyses, a structured consensus mechanism, and human-in-the-loop review, evaluated on the COMMITPACKFT dataset spanning 277 languages. Results show strong accuracy across four classification levels and substantial reductions in annotation time, with open-set cases benefiting most from human refinement. The study provides practical guidance for deploying human-AI collaboration in high-volume annotation tasks and highlights taxonomy challenges in open-set regimes.

Abstract

Content annotation at scale remains challenging, requiring substantial human expertise and effort. This paper presents a case study in code documentation analysis, where we explore the balance between automation efficiency and annotation accuracy. We present MCHR (Multi-LLM Consensus with Human Review), a novel semi-automated framework that enhances annotation scalability through the systematic integration of multiple LLMs and targeted human review. Our framework introduces a structured consensus-building mechanism among LLMs and an adaptive review protocol that strategically engages human expertise. Through our case study, we demonstrate that MCHR reduces annotation time by 32% to 100% compared to manual annotation while maintaining high accuracy (85.5% to 98%) across different difficulty levels, from basic binary classification to challenging open-set scenarios.

Paper Structure

This paper contains 7 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Overview of our semi-automated annotation framework: (A) Independent Model Analysis - multiple LLMs process input data independently; (B) Consensus Building - models collaborate to reach agreement; (C) Human-in-the-Loop Review - expert review for cases requiring human judgment.