Table of Contents
Fetching ...

Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop

Ekaterina Artemova, Akim Tsvigun, Dominik Schlechtweg, Natalia Fedorova, Konstantin Chernyshev, Sergei Tilga, Boris Obmoroshev

TL;DR

This paper presents a hands-on tutorial on labeling with large language models and human-in-the-loop approaches, addressing the demand for scalable, high-quality labeled data. It surveys and operationalizes strategies such as synthetic data generation, active learning, and hybrid labeling, supplemented by real-world case studies and best practices for annotator management and dataset quality. The tutorial emphasizes practical workflows, evaluation, and ethical considerations, including a 30-minute hands-on hybrid annotation session to demonstrate how human and model labeling can be integrated for cost-effective, accurate data. By combining theory with a concrete workshop and diverse expertise, the work offers actionable guidance for NLP practitioners to optimize data labeling pipelines in both research and industry contexts.

Abstract

Training and deploying machine learning models relies on a large amount of human-annotated data. As human labeling becomes increasingly expensive and time-consuming, recent research has developed multiple strategies to speed up annotation and reduce costs and human workload: generating synthetic training data, active learning, and hybrid labeling. This tutorial is oriented toward practical applications: we will present the basics of each strategy, highlight their benefits and limitations, and discuss in detail real-life case studies. Additionally, we will walk through best practices for managing human annotators and controlling the quality of the final dataset. The tutorial includes a hands-on workshop, where attendees will be guided in implementing a hybrid annotation setup. This tutorial is designed for NLP practitioners from both research and industry backgrounds who are involved in or interested in optimizing data labeling projects.

Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop

TL;DR

This paper presents a hands-on tutorial on labeling with large language models and human-in-the-loop approaches, addressing the demand for scalable, high-quality labeled data. It surveys and operationalizes strategies such as synthetic data generation, active learning, and hybrid labeling, supplemented by real-world case studies and best practices for annotator management and dataset quality. The tutorial emphasizes practical workflows, evaluation, and ethical considerations, including a 30-minute hands-on hybrid annotation session to demonstrate how human and model labeling can be integrated for cost-effective, accurate data. By combining theory with a concrete workshop and diverse expertise, the work offers actionable guidance for NLP practitioners to optimize data labeling pipelines in both research and industry contexts.

Abstract

Training and deploying machine learning models relies on a large amount of human-annotated data. As human labeling becomes increasingly expensive and time-consuming, recent research has developed multiple strategies to speed up annotation and reduce costs and human workload: generating synthetic training data, active learning, and hybrid labeling. This tutorial is oriented toward practical applications: we will present the basics of each strategy, highlight their benefits and limitations, and discuss in detail real-life case studies. Additionally, we will walk through best practices for managing human annotators and controlling the quality of the final dataset. The tutorial includes a hands-on workshop, where attendees will be guided in implementing a hybrid annotation setup. This tutorial is designed for NLP practitioners from both research and industry backgrounds who are involved in or interested in optimizing data labeling projects.

Paper Structure

This paper contains 17 sections.