Table of Contents
Fetching ...

A Comparative Study on Annotation Quality of Crowdsourcing and LLM via Label Aggregation

Jiyi Li

TL;DR

Addressing whether LLMs can outperform crowdsourcing for data annotation, the paper builds a benchmark by reusing existing crowdsourcing datasets that include text contents and individual crowd labels. It compares crowdsourced labels and LLM labels (ChatGPT and Vicuna) at both the single-label level and the aggregated-label level, and introduces a Crowd-LLM hybrid label aggregation approach using MV, DS, and GLAD. The findings show that adding high-quality LLM labels to crowd data generally improves the quality of the final aggregated labels, often exceeding the performance of LLM labels alone, while normal LLMs like Vicuna offer limited gains. The work provides practical guidance for annotation pipelines and highlights the continued value of crowd workers under quality-control conditions, while outlining limitations to categorical labeling and directions for future work.

Abstract

Whether Large Language Models (LLMs) can outperform crowdsourcing on the data annotation task is attracting interest recently. Some works verified this issue with the average performance of individual crowd workers and LLM workers on some specific NLP tasks by collecting new datasets. However, on the one hand, existing datasets for the studies of annotation quality in crowdsourcing are not yet utilized in such evaluations, which potentially provide reliable evaluations from a different viewpoint. On the other hand, the quality of these aggregated labels is crucial because, when utilizing crowdsourcing, the estimated labels aggregated from multiple crowd labels to the same instances are the eventually collected labels. Therefore, in this paper, we first investigate which existing crowdsourcing datasets can be used for a comparative study and create a benchmark. We then compare the quality between individual crowd labels and LLM labels and make the evaluations on the aggregated labels. In addition, we propose a Crowd-LLM hybrid label aggregation method and verify the performance. We find that adding LLM labels from good LLMs to existing crowdsourcing datasets can enhance the quality of the aggregated labels of the datasets, which is also higher than the quality of LLM labels themselves.

A Comparative Study on Annotation Quality of Crowdsourcing and LLM via Label Aggregation

TL;DR

Addressing whether LLMs can outperform crowdsourcing for data annotation, the paper builds a benchmark by reusing existing crowdsourcing datasets that include text contents and individual crowd labels. It compares crowdsourced labels and LLM labels (ChatGPT and Vicuna) at both the single-label level and the aggregated-label level, and introduces a Crowd-LLM hybrid label aggregation approach using MV, DS, and GLAD. The findings show that adding high-quality LLM labels to crowd data generally improves the quality of the final aggregated labels, often exceeding the performance of LLM labels alone, while normal LLMs like Vicuna offer limited gains. The work provides practical guidance for annotation pipelines and highlights the continued value of crowd workers under quality-control conditions, while outlining limitations to categorical labeling and directions for future work.

Abstract

Whether Large Language Models (LLMs) can outperform crowdsourcing on the data annotation task is attracting interest recently. Some works verified this issue with the average performance of individual crowd workers and LLM workers on some specific NLP tasks by collecting new datasets. However, on the one hand, existing datasets for the studies of annotation quality in crowdsourcing are not yet utilized in such evaluations, which potentially provide reliable evaluations from a different viewpoint. On the other hand, the quality of these aggregated labels is crucial because, when utilizing crowdsourcing, the estimated labels aggregated from multiple crowd labels to the same instances are the eventually collected labels. Therefore, in this paper, we first investigate which existing crowdsourcing datasets can be used for a comparative study and create a benchmark. We then compare the quality between individual crowd labels and LLM labels and make the evaluations on the aggregated labels. In addition, we propose a Crowd-LLM hybrid label aggregation method and verify the performance. We find that adding LLM labels from good LLMs to existing crowdsourcing datasets can enhance the quality of the aggregated labels of the datasets, which is also higher than the quality of LLM labels themselves.
Paper Structure (7 sections, 4 tables)