Understanding and Tackling Label Errors in Individual-Level Nature Language Understanding

Yunpeng Xiao; Youpeng Zhao; Kai Shu

Understanding and Tackling Label Errors in Individual-Level Nature Language Understanding

Yunpeng Xiao, Youpeng Zhao, Kai Shu

TL;DR

The paper addresses systematic label errors in individual-level NLU, where a single post may not reveal a publisher's stance or sentiment. It proposes a data-expansion methodology that aggregates multiple posts from the same user within a defined window and provides manual annotation guidelines plus LLM-based judgments to re-annotate stance and topic-based sentiment. The re-annotated datasets show substantial reductions in label noise and enable state-of-the-art performance by large language models (e.g., GPT-4o, Llama3-70B) with accuracies exceeding 87%. The work highlights the importance of social-contextual factors in dataset construction and demonstrates that LLMs can effectively handle individual-level NLU tasks when richer user-context is available, offering a practical path for cleaner annotations and more reliable evaluations.

Abstract

Natural language understanding (NLU) is a task that enables machines to understand human language. Some tasks, such as stance detection and sentiment analysis, are closely related to individual subjective perspectives, thus termed individual-level NLU. Previously, these tasks are often simplified to text-level NLU tasks, ignoring individual factors. This not only makes inference difficult and unexplainable but often results in a large number of label errors when creating datasets. To address the above limitations, we propose a new NLU annotation guideline based on individual-level factors. Specifically, we incorporate other posts by the same individual and then annotate individual subjective perspectives after considering all individual posts. We use this guideline to expand and re-annotate the stance detection and topic-based sentiment analysis datasets. We find that error rates in the samples were as high as 31.7\% and 23.3\%. We further use large language models to conduct experiments on the re-annotation datasets and find that the large language models perform well on both datasets after adding individual factors. Both GPT-4o and Llama3-70B can achieve an accuracy greater than 87\% on the re-annotation datasets. We also verify the effectiveness of individual factors through ablation studies. We call on future researchers to add individual factors when creating such datasets. Our re-annotation dataset can be found at https://github.com/24yearsoldstudent/Individual-NLU

Understanding and Tackling Label Errors in Individual-Level Nature Language Understanding

TL;DR

Abstract

Understanding and Tackling Label Errors in Individual-Level Nature Language Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)