Table of Contents
Fetching ...

Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered, Difficult, and Noisy

Min Zeng, Caiquan Liu, Shiqi Zhang, Li Xie, Chen Sang, Xiaoxin Chen

TL;DR

This work introduces Data Quality Enhancement (DQE), a strategy to improve text classification with large language models by organizing data into sampled and unsampled subsets via greedy K-Center sampling, fine-tuning on the sampled portion, and classifying incorrect predictions into uncovered, difficult, and noisy categories using cosine similarity. The method expands learning from uncovered and difficult samples while removing noisy data, yielding roughly half the data yet superior accuracy compared with training on full data and other baselines. Across six public datasets and two LLMs, DQE achieves state-of-the-art results, reduces training time by about 50%, and demonstrates robust handling of noisy labels and instruction-following. The approach combines semantic vectorization, LLM fine-tuning, and LLM-assisted data cleaning to offer a practical path to deploying high-quality, efficient text classifiers with LLMs.

Abstract

In recent years, the use of large language models (LLMs) for text classification has attracted widespread attention. Despite this, the classification accuracy of LLMs has not yet universally surpassed that of smaller models. LLMs can enhance their performance in text classification through fine-tuning. However, existing data quality research based on LLMs is challenging to apply directly to solve text classification problems. To further improve the performance of LLMs in classification tasks, this paper proposes a data quality enhancement (DQE) method for text classification based on LLMs. This method starts by using a greedy algorithm to select data, dividing the dataset into sampled and unsampled subsets, and then performing fine-tuning of the LLMs using the sampled data. Subsequently, this model is used to predict the outcomes for the unsampled data, categorizing incorrectly predicted data into uncovered, difficult, and noisy data. Experimental results demonstrate that our method effectively enhances the performance of LLMs in text classification tasks and significantly improves training efficiency, saving nearly half of the training time. Our method has achieved state-of-the-art performance in several open-source classification tasks.

Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered, Difficult, and Noisy

TL;DR

This work introduces Data Quality Enhancement (DQE), a strategy to improve text classification with large language models by organizing data into sampled and unsampled subsets via greedy K-Center sampling, fine-tuning on the sampled portion, and classifying incorrect predictions into uncovered, difficult, and noisy categories using cosine similarity. The method expands learning from uncovered and difficult samples while removing noisy data, yielding roughly half the data yet superior accuracy compared with training on full data and other baselines. Across six public datasets and two LLMs, DQE achieves state-of-the-art results, reduces training time by about 50%, and demonstrates robust handling of noisy labels and instruction-following. The approach combines semantic vectorization, LLM fine-tuning, and LLM-assisted data cleaning to offer a practical path to deploying high-quality, efficient text classifiers with LLMs.

Abstract

In recent years, the use of large language models (LLMs) for text classification has attracted widespread attention. Despite this, the classification accuracy of LLMs has not yet universally surpassed that of smaller models. LLMs can enhance their performance in text classification through fine-tuning. However, existing data quality research based on LLMs is challenging to apply directly to solve text classification problems. To further improve the performance of LLMs in classification tasks, this paper proposes a data quality enhancement (DQE) method for text classification based on LLMs. This method starts by using a greedy algorithm to select data, dividing the dataset into sampled and unsampled subsets, and then performing fine-tuning of the LLMs using the sampled data. Subsequently, this model is used to predict the outcomes for the unsampled data, categorizing incorrectly predicted data into uncovered, difficult, and noisy data. Experimental results demonstrate that our method effectively enhances the performance of LLMs in text classification tasks and significantly improves training efficiency, saving nearly half of the training time. Our method has achieved state-of-the-art performance in several open-source classification tasks.

Paper Structure

This paper contains 17 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The overall structure of the DQE method.
  • Figure 2: The identification process of Uncovered, Difficulty, and Noisy.
  • Figure 3: the proportion between uncovered, difficult, and noisy data, as well as their combined proportion in the entire training set.
  • Figure 4: Example of noisy data found by the DQE.
  • Figure 5: Examples of prediction results of DQE on the test set.