Table of Contents
Fetching ...

Efficient Human-in-the-Loop Active Learning: A Novel Framework for Data Labeling in AI Systems

Yiran Huang, Jian-Feng Yang, Haoda Fu

TL;DR

This work tackles data-labeling efficiency in AI by proposing a human-in-the-loop active learning framework that supports multiple query types and integrates full and partial information through a probabilistic model $p(\cdot;\theta)$. It introduces an entropy-based active-learning criterion with cost-aware, multi-question querying and a data-driven exploration-exploitation frame that employs a model-guided distance $d(x,x';\theta)$ to adaptively balance exploration and exploitation. Theoretical results bound the likelihood of unexpected answers under accurate models, and empirical studies across five real datasets (including logistic, neural, and transfer-learning scenarios) show faster learning, higher final accuracy, and lower cross-entropy loss than traditional AL baselines, with additional gains from the exploration-exploitation component. The framework is versatile across probabilistic models and practical for domains requiring efficient labeling, such as medical imaging and large-scale image tasks, while leaving room for extensions to non-Gaussian partial-information modeling and evolving class sets.

Abstract

Modern AI algorithms require labeled data. In real world, majority of data are unlabeled. Labeling the data are costly. this is particularly true for some areas requiring special skills, such as reading radiology images by physicians. To most efficiently use expert's time for the data labeling, one promising approach is human-in-the-loop active learning algorithm. In this work, we propose a novel active learning framework with significant potential for application in modern AI systems. Unlike the traditional active learning methods, which only focus on determining which data point should be labeled, our framework also introduces an innovative perspective on incorporating different query scheme. We propose a model to integrate the information from different types of queries. Based on this model, our active learning frame can automatically determine how the next question is queried. We further developed a data driven exploration and exploitation framework into our active learning method. This method can be embedded in numerous active learning algorithms. Through simulations on five real-world datasets, including a highly complex real image task, our proposed active learning framework exhibits higher accuracy and lower loss compared to other methods.

Efficient Human-in-the-Loop Active Learning: A Novel Framework for Data Labeling in AI Systems

TL;DR

This work tackles data-labeling efficiency in AI by proposing a human-in-the-loop active learning framework that supports multiple query types and integrates full and partial information through a probabilistic model . It introduces an entropy-based active-learning criterion with cost-aware, multi-question querying and a data-driven exploration-exploitation frame that employs a model-guided distance to adaptively balance exploration and exploitation. Theoretical results bound the likelihood of unexpected answers under accurate models, and empirical studies across five real datasets (including logistic, neural, and transfer-learning scenarios) show faster learning, higher final accuracy, and lower cross-entropy loss than traditional AL baselines, with additional gains from the exploration-exploitation component. The framework is versatile across probabilistic models and practical for domains requiring efficient labeling, such as medical imaging and large-scale image tasks, while leaving room for extensions to non-Gaussian partial-information modeling and evolving class sets.

Abstract

Modern AI algorithms require labeled data. In real world, majority of data are unlabeled. Labeling the data are costly. this is particularly true for some areas requiring special skills, such as reading radiology images by physicians. To most efficiently use expert's time for the data labeling, one promising approach is human-in-the-loop active learning algorithm. In this work, we propose a novel active learning framework with significant potential for application in modern AI systems. Unlike the traditional active learning methods, which only focus on determining which data point should be labeled, our framework also introduces an innovative perspective on incorporating different query scheme. We propose a model to integrate the information from different types of queries. Based on this model, our active learning frame can automatically determine how the next question is queried. We further developed a data driven exploration and exploitation framework into our active learning method. This method can be embedded in numerous active learning algorithms. Through simulations on five real-world datasets, including a highly complex real image task, our proposed active learning framework exhibits higher accuracy and lower loss compared to other methods.
Paper Structure (12 sections, 6 theorems, 23 equations, 7 figures, 3 tables, 2 algorithms)

This paper contains 12 sections, 6 theorems, 23 equations, 7 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

For any model $h\in\mathcal{H}_0$, suppose the true label for $x$ is $y$. Then the entropy is within the range

Figures (7)

  • Figure 1: Illustration of exploration and exploitation frame.
  • Figure 2: Illustrations of Proposition \ref{['prop: entropy range']}, Corollary \ref{['coro: entropy valid']} and Theorem \ref{['thm: eande valid']}.
  • Figure 3: Accuracy and the sum of cross-entropy of active learning for the first dataset. Pro, En and Ra indicate the proposed active learning method and the two traditional active learning methods mentioned at the beginning of Section \ref{['sec:Simulation']} respectively. If "dis" is added, then the exploration and exploitation frame is taken into consideration. IdealAL is the "ideal" active learning process and "Optimal" is the optimal model trained by an extensive set of data.
  • Figure 4: Accuracy and sum of cross-entropy of active learning for the second dataset.
  • Figure 5: Accuracy and sum of cross-entropy of active learning for the third dataset.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Proposition 1
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Theorem 3