Table of Contents
Fetching ...

Can Active Label Correction Improve LLM-based Modular AI Systems?

Karan Taneja, Ashok Goel

TL;DR

This paper studies the noise in three GPT-3.5-annotated datasets and proposes a novel method ALC3 that iteratively applies three updates to the training dataset: auto-correction, correction using human feedback and filtering.

Abstract

Modular AI systems can be developed using LLM-prompts-based modules to minimize deployment time even for complex tasks. However, these systems do not always perform well and improving them using the data traces collected from a deployment remains an open challenge. The data traces contain LLM inputs and outputs, but the annotations from LLMs are noisy. We hypothesize that Active Label Correction (ALC) can be use on the collected data to train smaller task-specific improved models that can replace LLM-based modules. In this paper, we study the noise in three GPT-3.5-annotated datasets and their denoising with human feedback. We also propose a novel method ALC3 that iteratively applies three updates to the training dataset: auto-correction, correction using human feedback and filtering. Our results show that ALC3 can lead to oracle performance with feedback on 17-24% fewer examples than the number of noisy examples in the dataset across three different NLP tasks.

Can Active Label Correction Improve LLM-based Modular AI Systems?

TL;DR

This paper studies the noise in three GPT-3.5-annotated datasets and proposes a novel method ALC3 that iteratively applies three updates to the training dataset: auto-correction, correction using human feedback and filtering.

Abstract

Modular AI systems can be developed using LLM-prompts-based modules to minimize deployment time even for complex tasks. However, these systems do not always perform well and improving them using the data traces collected from a deployment remains an open challenge. The data traces contain LLM inputs and outputs, but the annotations from LLMs are noisy. We hypothesize that Active Label Correction (ALC) can be use on the collected data to train smaller task-specific improved models that can replace LLM-based modules. In this paper, we study the noise in three GPT-3.5-annotated datasets and their denoising with human feedback. We also propose a novel method ALC3 that iteratively applies three updates to the training dataset: auto-correction, correction using human feedback and filtering. Our results show that ALC3 can lead to oracle performance with feedback on 17-24% fewer examples than the number of noisy examples in the dataset across three different NLP tasks.
Paper Structure (23 sections, 6 equations, 6 figures, 3 tables)

This paper contains 23 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Noisy LLM-annotated datasets are collected from deployment of a modular AI system. Active Label Correction (ALC) is used to predict and correct misannotated examples in order to train a replacement model.
  • Figure 2: Proposed process for improving LLM-based modular AI systems using ALC3. The inputs and noisy labels from a zero/few-shot learner-based module are used to obtain a trained model. Model predictions on the noisy training dataset are computed for the next three steps. (i) Auto-correction updates the labels where model prediction contradicts the original label with very high confidence. (ii) Human annotation is used to verify and update a fixed number of confusing examples. (iii) Filtering removes some of remaining examples that are deemed noisy based on model predictions. The process is performed iteratively until a stopping condition. Only human annotations are retained after each iteration, iteration two is shown with columns 6, 7, 8, and 9 for illustration.
  • Figure 3: A 2D projection of ATIS text embeddings for a subset of 7 classes. GPT-3.5 annotations are indicated by colors while large dots indicate errors. Most misannotated examples lie near cluster boundaries.
  • Figure 4: MP precision and recall for ATIS, CoNLL, and QNLI with change in $M$. ALC3 results are same as DALC, and both perform worse than ALC because data quality is improved with auto-correction before MP.
  • Figure 5: Effect of data size on MP precision. MP precision reduces as more examples are flagged and reduces as data size is decreased, but we observe diminishing returns with increase in data size.
  • ...and 1 more figures