Table of Contents
Fetching ...

Understanding and Tackling Label Errors in Individual-Level Nature Language Understanding

Yunpeng Xiao, Youpeng Zhao, Kai Shu

TL;DR

The paper addresses systematic label errors in individual-level NLU, where a single post may not reveal a publisher's stance or sentiment. It proposes a data-expansion methodology that aggregates multiple posts from the same user within a defined window and provides manual annotation guidelines plus LLM-based judgments to re-annotate stance and topic-based sentiment. The re-annotated datasets show substantial reductions in label noise and enable state-of-the-art performance by large language models (e.g., GPT-4o, Llama3-70B) with accuracies exceeding 87%. The work highlights the importance of social-contextual factors in dataset construction and demonstrates that LLMs can effectively handle individual-level NLU tasks when richer user-context is available, offering a practical path for cleaner annotations and more reliable evaluations.

Abstract

Natural language understanding (NLU) is a task that enables machines to understand human language. Some tasks, such as stance detection and sentiment analysis, are closely related to individual subjective perspectives, thus termed individual-level NLU. Previously, these tasks are often simplified to text-level NLU tasks, ignoring individual factors. This not only makes inference difficult and unexplainable but often results in a large number of label errors when creating datasets. To address the above limitations, we propose a new NLU annotation guideline based on individual-level factors. Specifically, we incorporate other posts by the same individual and then annotate individual subjective perspectives after considering all individual posts. We use this guideline to expand and re-annotate the stance detection and topic-based sentiment analysis datasets. We find that error rates in the samples were as high as 31.7\% and 23.3\%. We further use large language models to conduct experiments on the re-annotation datasets and find that the large language models perform well on both datasets after adding individual factors. Both GPT-4o and Llama3-70B can achieve an accuracy greater than 87\% on the re-annotation datasets. We also verify the effectiveness of individual factors through ablation studies. We call on future researchers to add individual factors when creating such datasets. Our re-annotation dataset can be found at https://github.com/24yearsoldstudent/Individual-NLU

Understanding and Tackling Label Errors in Individual-Level Nature Language Understanding

TL;DR

The paper addresses systematic label errors in individual-level NLU, where a single post may not reveal a publisher's stance or sentiment. It proposes a data-expansion methodology that aggregates multiple posts from the same user within a defined window and provides manual annotation guidelines plus LLM-based judgments to re-annotate stance and topic-based sentiment. The re-annotated datasets show substantial reductions in label noise and enable state-of-the-art performance by large language models (e.g., GPT-4o, Llama3-70B) with accuracies exceeding 87%. The work highlights the importance of social-contextual factors in dataset construction and demonstrates that LLMs can effectively handle individual-level NLU tasks when richer user-context is available, offering a practical path for cleaner annotations and more reliable evaluations.

Abstract

Natural language understanding (NLU) is a task that enables machines to understand human language. Some tasks, such as stance detection and sentiment analysis, are closely related to individual subjective perspectives, thus termed individual-level NLU. Previously, these tasks are often simplified to text-level NLU tasks, ignoring individual factors. This not only makes inference difficult and unexplainable but often results in a large number of label errors when creating datasets. To address the above limitations, we propose a new NLU annotation guideline based on individual-level factors. Specifically, we incorporate other posts by the same individual and then annotate individual subjective perspectives after considering all individual posts. We use this guideline to expand and re-annotate the stance detection and topic-based sentiment analysis datasets. We find that error rates in the samples were as high as 31.7\% and 23.3\%. We further use large language models to conduct experiments on the re-annotation datasets and find that the large language models perform well on both datasets after adding individual factors. Both GPT-4o and Llama3-70B can achieve an accuracy greater than 87\% on the re-annotation datasets. We also verify the effectiveness of individual factors through ablation studies. We call on future researchers to add individual factors when creating such datasets. Our re-annotation dataset can be found at https://github.com/24yearsoldstudent/Individual-NLU

Paper Structure

This paper contains 23 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: A typical example of potential label error in stance detection.
  • Figure 2: The process of manual re-annotation and LLMs judges. In the manual re-annotation, after finding other posts related to the topic/target by individuals (network users), three annotators follow the guidelines to annotate individual-level labels. In LLMs Judges, the input is divided into two parts: data and prompts.
  • Figure 3: Some examples of correcting label errors. Using multiple posts from the same publisher can more accurately determine the user's sentiment or stance, and can effectively explain why this label is given.
  • Figure 4: Performance on the Semeval stance detection dataset using different numbers of tweets as LLM input.
  • Figure 5: Multi-posts example. In this case, although the user's stance is "Against" in both the new and original datasets, it is difficult or even impossible to infer the user's stance from the text in the original dataset. After adding other tweets from the user, LLM gives an accurate prediction.
  • ...and 1 more figures