How much reliable is ChatGPT's prediction on Information Extraction under Input Perturbations?

Ishani Mondal; Abhilasha Sancheti

How much reliable is ChatGPT's prediction on Information Extraction under Input Perturbations?

Ishani Mondal, Abhilasha Sancheti

TL;DR

This work investigates how ChatGPT's Named Entity Recognition predictions respond to input perturbations, systematically perturbing entity tokens and their contexts. It combines automatic and human evaluations on CONLL-2003 and BC5CDR to analyze accuracy, explanation faithfulness, and confidence calibration under zero-shot and in-context learning settings. Key findings show that robustness is highly dependent on perturbation type and entity domain, with explanations shifting between global and local cues, while in-context learning mitigates overconfidence and enhances explanation quality. The study highlights practical implications for deploying LLMs in information extraction, and identifies limitations around factual grounding of global explanations and sentence integrity under perturbations. The methodological framework and perturbation taxonomy provide a template for evaluating reliability in IE tasks involving LLMs.

Abstract

In this paper, we assess the robustness (reliability) of ChatGPT under input perturbations for one of the most fundamental tasks of Information Extraction (IE) i.e. Named Entity Recognition (NER). Despite the hype, the majority of the researchers have vouched for its language understanding and generation capabilities; a little attention has been paid to understand its robustness: How the input-perturbations affect 1) the predictions, 2) the confidence of predictions and 3) the quality of rationale behind its prediction. We perform a systematic analysis of ChatGPT's robustness (under both zero-shot and few-shot setup) on two NER datasets using both automatic and human evaluation. Based on automatic evaluation metrics, we find that 1) ChatGPT is more brittle on Drug or Disease replacements (rare entities) compared to the perturbations on widely known Person or Location entities, 2) the quality of explanations for the same entity considerably differ under different types of "Entity-Specific" and "Context-Specific" perturbations and the quality can be significantly improved using in-context learning, and 3) it is overconfident for majority of the incorrect predictions, and hence it could lead to misguidance of the end-users.

How much reliable is ChatGPT's prediction on Information Extraction under Input Perturbations?

TL;DR

Abstract

Paper Structure (39 sections, 3 figures, 7 tables)

This paper contains 39 sections, 3 figures, 7 tables.

Introduction
Can we automatically generate input perturbations?
B. Context-Specific:
Experimental Setup
Datasets:
Evaluation Criteria:
1. Performance Difference under Perturbation:
2. Difference in Quality of Explanations due to perturbation:
3. Confidence Calibration under Perturbation:
Zero-shot and Few-shot Setup:
How are the prompts designed?
Implementation Details
How to estimate Reliability?
Automatic Evaluation
Is there any effect of ChatGPT's NER prediction on the target entity?
...and 24 more sections

Figures (3)

Figure 1: An example of sentence from BC5CDR in which the disease entity orthostatic hypotension has been perturbed with a synonym orthostatis. Before perturbation, the disease was correctly predicted and explained with high confidence (90%). After perturbation, degree has been incorrectly predicted as a disease entity with a wrong explanation. However, ChatGPT is nearly equally confident (80%) as the situation when it made a correct prediction.
Figure 2: Percentage of examples Before and After perturbation for which the explanations are less informative such as "refers to a country/person", "it is a chemical compound/substance" for BC5CDR and CONLL datasets.
Figure 3: Percentage of (input, perturbed input) pairs with change in type of explanations for (i) target and (ii) non-target entities in BC5CDR.

How much reliable is ChatGPT's prediction on Information Extraction under Input Perturbations?

TL;DR

Abstract

How much reliable is ChatGPT's prediction on Information Extraction under Input Perturbations?

Authors

TL;DR

Abstract

Table of Contents

Figures (3)