Table of Contents
Fetching ...

STAYKATE: Hybrid In-Context Example Selection Combining Representativeness Sampling and Retrieval-based Approach -- A Case Study on Science Domains

Chencheng Zhu, Kazutaka Shimada, Tomoki Taniguchi, Tomoko Ohkuma

TL;DR

Addresses the sensitivity of in-context learning to the choice of demonstrations in scientific NER under low-resource settings. Proposes STAYKATE, a static&dynamic hybrid method that combines representativeness sampling for static exemplars with KNN-Augmented in-context (KATE) retrieval for dynamic prompts, accompanied by a four-part GPT-3.5 prompt (system role, task instructions, in-context examples, and test input). Demonstrates on MSPT, WLP, and BC5CDR that STAYKATE outperforms fine-tuned BERT and existing selection methods, with the largest gains on domain-specific entity types. Finds that STAYKATE reduces overpredicting and improves disambiguation, enhancing robust scientific information extraction in low-resource scenarios.

Abstract

Large language models (LLMs) demonstrate the ability to learn in-context, offering a potential solution for scientific information extraction, which often contends with challenges such as insufficient training data and the high cost of annotation processes. Given that the selection of in-context examples can significantly impact performance, it is crucial to design a proper method to sample the efficient ones. In this paper, we propose STAYKATE, a static-dynamic hybrid selection method that combines the principles of representativeness sampling from active learning with the prevalent retrieval-based approach. The results across three domain-specific datasets indicate that STAYKATE outperforms both the traditional supervised methods and existing selection methods. The enhancement in performance is particularly pronounced for entity types that other methods pose challenges.

STAYKATE: Hybrid In-Context Example Selection Combining Representativeness Sampling and Retrieval-based Approach -- A Case Study on Science Domains

TL;DR

Addresses the sensitivity of in-context learning to the choice of demonstrations in scientific NER under low-resource settings. Proposes STAYKATE, a static&dynamic hybrid method that combines representativeness sampling for static exemplars with KNN-Augmented in-context (KATE) retrieval for dynamic prompts, accompanied by a four-part GPT-3.5 prompt (system role, task instructions, in-context examples, and test input). Demonstrates on MSPT, WLP, and BC5CDR that STAYKATE outperforms fine-tuned BERT and existing selection methods, with the largest gains on domain-specific entity types. Finds that STAYKATE reduces overpredicting and improves disambiguation, enhancing robust scientific information extraction in low-resource scenarios.

Abstract

Large language models (LLMs) demonstrate the ability to learn in-context, offering a potential solution for scientific information extraction, which often contends with challenges such as insufficient training data and the high cost of annotation processes. Given that the selection of in-context examples can significantly impact performance, it is crucial to design a proper method to sample the efficient ones. In this paper, we propose STAYKATE, a static-dynamic hybrid selection method that combines the principles of representativeness sampling from active learning with the prevalent retrieval-based approach. The results across three domain-specific datasets indicate that STAYKATE outperforms both the traditional supervised methods and existing selection methods. The enhancement in performance is particularly pronounced for entity types that other methods pose challenges.
Paper Structure (34 sections, 3 equations, 9 figures, 5 tables)

This paper contains 34 sections, 3 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The overall process of STAYKATE. The right side of the figure shows the entire prompt structure.
  • Figure 2: Our experimental setting about data pool.
  • Figure 3: The distribution of predictive entropy for each dataset.
  • Figure 4: Statistics on the percentage of various error types.
  • Figure 5: Statistics of errors across different selection methods for MSPT.
  • ...and 4 more figures