Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains

Marcio Fonseca; Shay B. Cohen

Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains

Marcio Fonseca, Shay B. Cohen

TL;DR

The paper investigates whether instruction-tuned large language models can learn and apply in-context concept definitions to classify sentences, using factual, empty, out-of-dictionary, and counterfactual guidelines across scientific and financial domains. It presents a unified framework linking concept definitions, prompts, and prior knowledge to label inference, and evaluates a spectrum of models from open-source Llama-2 variants to Falcon-180B and proprietary GPT-3.5/4. The findings show that concept definitions reliably boost performance, but robust counterfactual understanding emerges mainly in the largest proprietary models, with open-source models often failing under nonsensical or highly counterfactual prompts. The work highlights important gaps between leading open-source and proprietary systems in concept grounding and suggests that alignment strategies, not just scale, drive the ability to handle challenging, counterfactual scenarios.

Abstract

Although large language models (LLMs) exhibit remarkable capacity to leverage in-context demonstrations, it is still unclear to what extent they can learn new concepts or facts from ground-truth labels. To address this question, we examine the capacity of instruction-tuned LLMs to follow in-context concept guidelines for sentence labeling tasks. We design guidelines that present different types of factual and counterfactual concept definitions, which are used as prompts for zero-shot sentence classification tasks. Our results show that although concept definitions consistently help in task performance, only the larger models (with 70B parameters or more) have limited ability to work under counterfactual contexts. Importantly, only proprietary models such as GPT-3.5 and GPT-4 can recognize nonsensical guidelines, which we hypothesize is due to more sophisticated alignment methods. Finally, we find that Falcon-180B-chat is outperformed by Llama-2-70B-chat is most cases, which indicates that careful fine-tuning is more effective than increasing model scale. Altogether, our simple evaluation method reveals significant gaps in concept understanding between the most capable open-source language models and the leading proprietary APIs.

Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains

TL;DR

Abstract

Paper Structure (37 sections, 3 equations, 8 figures, 7 tables)

This paper contains 37 sections, 3 equations, 8 figures, 7 tables.

Introduction
Concept Classification with Guidelines
Concept Guidelines
Factual guidelines $G_f$
Out-of-dictionary guidelines $G_{\text{OOD}}$
Empty-definition guidelines $G_\varepsilon$
Counterfactual guidelines $G_c$
Concept Definitions
Experimental Setup
Scientific Concepts Dataset
Financial Concepts Dataset
Data Collection
Annotation Scheme
Annotation Process
Hiring and training
...and 22 more sections

Figures (8)

Figure 1: An abridged example of zero-shot sentence classification using a concept guideline prompt. We perform controlled interventions in concept definitions (pairs of concept labels $c_K$ and their descriptions $\delta(c_K)$) while keeping the task prompt fixed. We aim to gauge the capacity of the model to learn new concepts during inference, without in-context demonstrations.
Figure 2: Concept classification accuracy for different scientific (top) and financial (bottom) concept guidelines. In this experiment, the counterfactual guideline $G_c$ is a random permutation where all concept definitions are counterfactual. Empty-Def refers to the empty-definition factual ($G_{f,\varepsilon}$) and out-of-vocabulary guidelines ($G_{OOD,\varepsilon}$). Error bars represent the 95% confidence interval and the dashed line indicates the random classifier baseline.
Figure 3: Concept classification accuracy results for different levels of counterfactuality of scientific (left) and financial (right) concept guidelines. We sample 10 guidelines for each counterfactuality level and average the classification accuracies. Error bars represent the standard deviations.
Figure 4: Guideline adherence scores per financial and scientific concept for GPT-3.5. Each cell $A_{ij}$ shows the fraction of concept predictions that adhere to concept definitions $\delta(c_j) = d_i$, where the rows indicate original factual labels $c_i$ that are randomly replaced by labels $c_j$ (columns). Off-diagonal results indicate counterfactual definitions.
Figure 5: Concept classification accuracy for different financial concept guidelines, using the same definitions provided to human labelers (Figure \ref{['fig:financial_annotation_guidelines']}). In this experiment, the counterfactual guideline $G_c$ is a random permutation where all concept definitions are counterfactual. Empty-Def refers to the empty-definition factual ($G_{f,\varepsilon}$) and out-of-vocabulary guidelines ($G_{OOD,\varepsilon}$). Error bars represent the 95% confidence interval and the dashed line indicates the random classifier baseline.
...and 3 more figures

Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains

TL;DR

Abstract

Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains

Authors

TL;DR

Abstract

Table of Contents

Figures (8)