Table of Contents
Fetching ...

On the Noise Robustness of In-Context Learning for Text Generation

Hongfu Gao, Feipeng Zhang, Wenyu Jiang, Jun Shu, Feng Zheng, Hongxin Wei

TL;DR

This work shows that, on text generation tasks, noisy annotations significantly hurt the performance of in-context learning, and proposes a simple and effective approach called Local Perplexity Ranking (LPR), which replaces the noisy candidates with their nearest neighbors that are more likely to be clean.

Abstract

Large language models (LLMs) have shown impressive performance on downstream tasks by in-context learning (ICL), which heavily relies on the quality of demonstrations selected from a large set of annotated examples. Recent works claim that in-context learning is robust to noisy demonstrations in text classification. In this work, we show that, on text generation tasks, noisy annotations significantly hurt the performance of in-context learning. To circumvent the issue, we propose a simple and effective approach called Local Perplexity Ranking (LPR), which replaces the "noisy" candidates with their nearest neighbors that are more likely to be clean. Our method is motivated by analyzing the perplexity deviation caused by noisy labels and decomposing perplexity into inherent perplexity and matching perplexity. Our key idea behind LPR is thus to decouple the matching perplexity by performing the ranking among the neighbors in semantic space. Our approach can prevent the selected demonstrations from including mismatched input-label pairs while preserving the effectiveness of the original selection methods. Extensive experiments demonstrate the effectiveness of LPR, improving the EM score by up to 18.75 on common benchmarks with noisy annotations. Our code is available at https://github.com/ml-stat-Sustech/Local-Perplexity-Ranking.

On the Noise Robustness of In-Context Learning for Text Generation

TL;DR

This work shows that, on text generation tasks, noisy annotations significantly hurt the performance of in-context learning, and proposes a simple and effective approach called Local Perplexity Ranking (LPR), which replaces the noisy candidates with their nearest neighbors that are more likely to be clean.

Abstract

Large language models (LLMs) have shown impressive performance on downstream tasks by in-context learning (ICL), which heavily relies on the quality of demonstrations selected from a large set of annotated examples. Recent works claim that in-context learning is robust to noisy demonstrations in text classification. In this work, we show that, on text generation tasks, noisy annotations significantly hurt the performance of in-context learning. To circumvent the issue, we propose a simple and effective approach called Local Perplexity Ranking (LPR), which replaces the "noisy" candidates with their nearest neighbors that are more likely to be clean. Our method is motivated by analyzing the perplexity deviation caused by noisy labels and decomposing perplexity into inherent perplexity and matching perplexity. Our key idea behind LPR is thus to decouple the matching perplexity by performing the ranking among the neighbors in semantic space. Our approach can prevent the selected demonstrations from including mismatched input-label pairs while preserving the effectiveness of the original selection methods. Extensive experiments demonstrate the effectiveness of LPR, improving the EM score by up to 18.75 on common benchmarks with noisy annotations. Our code is available at https://github.com/ml-stat-Sustech/Local-Perplexity-Ranking.
Paper Structure (26 sections, 10 equations, 7 figures, 16 tables)

This paper contains 26 sections, 10 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Average ICL performance with noisy annotations in various generation tasks across different demonstration settings. Both the two types of noises significantly deteriorate the performance of in-context learning on text generation tasks. The black line denotes zero-shot performance.
  • Figure 2: The distribution of perplexity of Llama2-7B touvron2023llama on clean and noisy annotations. Examples with noisy annotations indeed obtain higher perplexity than those with clean annotations.
  • Figure 3: The average test performance with different thresholds $\tau$ and numbers of local neighbors $k$ across various noise types. Figure (a) and (b) analyze how the hyperparameter $\tau$ affects the performance of LPR. Figure (c) and (d) illustrate the influence of the hyperparameter $k$.
  • Figure 4: Average test accuracy on SST2 socher-etal-2013-recursive and AGNews zhang2015character. Different colors indicate the selection methods. The solid lines denote existing selection methods, and the dotted lines represent the method integrated by our method. We omit the noisy type on the binary classification -- SST2.
  • Figure 5: Average results of ICL with noisy annotations in various generation tasks across different demonstration settings. Both the two types of noises significantly deteriorate the performance of in-context learning on code generation tasks. The black line denotes zero-shot performance.
  • ...and 2 more figures