Table of Contents
Fetching ...

ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking

Lequan Lin, Dai Shi, Andi Han, Feng Chen, Qiuzheng Chen, Jiawen Li, Zhaoyang Li, Jiyuan Li, Zhenbang Sun, Junbin Gao

TL;DR

The paper presents ACT, a data annotation pipeline where a multimodal LLM annotator is complemented by a separate criticizer to estimate per-sample error probabilities, enabling budgeted human review of the most suspicious cases. It formalizes the framework, defines budget-aware sampling rules, and introduces AQG and ABS to quantify annotation gain and budget efficiency, supported by a theoretical analysis of an ACT loss that remains unbiased with controlled variance. Empirically, ACT achieves downstream performance within approximately 2% of fully human-annotated models while reducing human costs by up to 90% across NLP, CV, and multimodal tasks, with exponential weighting and thresholding often outperforming normalization. The work also offers practical guidelines for annotator-criticizer selection, demonstrates the benefits and limitations of white-box versus black-box criticizers, and discusses extending ACT to more complex tasks and ethical considerations.

Abstract

Supervised learning relies on high-quality labeled data, but obtaining such data through human annotation is both expensive and time-consuming. Recent work explores using large language models (LLMs) for annotation, but LLM-generated labels still fall short of human-level quality. To address this problem, we propose the Annotation with Critical Thinking (ACT) data pipeline, where LLMs serve not only as annotators but also as judges to critically identify potential errors. Human effort is then directed towards reviewing only the most "suspicious" cases, significantly improving the human annotation efficiency. Our major contributions are as follows: (1) ACT is applicable to a wide range of domains, including natural language processing (NLP), computer vision (CV), and multimodal understanding, by leveraging multimodal-LLMs (MLLMs). (2) Through empirical studies, we derive 7 insights on how to enhance annotation quality while efficiently reducing the human cost, and then translate these findings into user-friendly guidelines. (3) We theoretically analyze how to modify the loss function so that models trained on ACT data achieve similar performance to those trained on fully human-annotated data. Our experiments show that the performance gap can be reduced to less than 2% on most benchmark datasets while saving up to 90% of human costs.

ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking

TL;DR

The paper presents ACT, a data annotation pipeline where a multimodal LLM annotator is complemented by a separate criticizer to estimate per-sample error probabilities, enabling budgeted human review of the most suspicious cases. It formalizes the framework, defines budget-aware sampling rules, and introduces AQG and ABS to quantify annotation gain and budget efficiency, supported by a theoretical analysis of an ACT loss that remains unbiased with controlled variance. Empirically, ACT achieves downstream performance within approximately 2% of fully human-annotated models while reducing human costs by up to 90% across NLP, CV, and multimodal tasks, with exponential weighting and thresholding often outperforming normalization. The work also offers practical guidelines for annotator-criticizer selection, demonstrates the benefits and limitations of white-box versus black-box criticizers, and discusses extending ACT to more complex tasks and ethical considerations.

Abstract

Supervised learning relies on high-quality labeled data, but obtaining such data through human annotation is both expensive and time-consuming. Recent work explores using large language models (LLMs) for annotation, but LLM-generated labels still fall short of human-level quality. To address this problem, we propose the Annotation with Critical Thinking (ACT) data pipeline, where LLMs serve not only as annotators but also as judges to critically identify potential errors. Human effort is then directed towards reviewing only the most "suspicious" cases, significantly improving the human annotation efficiency. Our major contributions are as follows: (1) ACT is applicable to a wide range of domains, including natural language processing (NLP), computer vision (CV), and multimodal understanding, by leveraging multimodal-LLMs (MLLMs). (2) Through empirical studies, we derive 7 insights on how to enhance annotation quality while efficiently reducing the human cost, and then translate these findings into user-friendly guidelines. (3) We theoretically analyze how to modify the loss function so that models trained on ACT data achieve similar performance to those trained on fully human-annotated data. Our experiments show that the performance gap can be reduced to less than 2% on most benchmark datasets while saving up to 90% of human costs.

Paper Structure

This paper contains 44 sections, 22 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Illustration of the ACT data pipeline (left) and the user guidelines (right).
  • Figure 2: Accuracy of various models as machine annotators with different prompt strategies.
  • Figure 3: Results of black-box strategies. The metric shown is ABS (%), where higher values indicate better annotation efficiency of the ACT data pipeline. The best results are highlighted with black frames.
  • Figure 4: Comparison of ABS (%) for the same model using black-box and white-box strategies. Green bars correspond to black-box strategies (B), while blue bars represent white-box strategies (W). The horizontal dashed lines indicate the best black-box results for reference.
  • Figure 5: Relationship between annotation and criticism abilities. The base annotators for criticism ability evaluation are shown in the titles (all with top 1 annotation ability from Figure \ref{['fig:exp_annot_quality']}). Each point represents a model. The models with the highest ABS are highlighted with stars. The slopes of the fitted regression lines are indicated in the legends.
  • ...and 8 more figures

Theorems & Definitions (2)

  • proof
  • proof