Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains
Marcio Fonseca, Shay B. Cohen
TL;DR
The paper investigates whether instruction-tuned large language models can learn and apply in-context concept definitions to classify sentences, using factual, empty, out-of-dictionary, and counterfactual guidelines across scientific and financial domains. It presents a unified framework linking concept definitions, prompts, and prior knowledge to label inference, and evaluates a spectrum of models from open-source Llama-2 variants to Falcon-180B and proprietary GPT-3.5/4. The findings show that concept definitions reliably boost performance, but robust counterfactual understanding emerges mainly in the largest proprietary models, with open-source models often failing under nonsensical or highly counterfactual prompts. The work highlights important gaps between leading open-source and proprietary systems in concept grounding and suggests that alignment strategies, not just scale, drive the ability to handle challenging, counterfactual scenarios.
Abstract
Although large language models (LLMs) exhibit remarkable capacity to leverage in-context demonstrations, it is still unclear to what extent they can learn new concepts or facts from ground-truth labels. To address this question, we examine the capacity of instruction-tuned LLMs to follow in-context concept guidelines for sentence labeling tasks. We design guidelines that present different types of factual and counterfactual concept definitions, which are used as prompts for zero-shot sentence classification tasks. Our results show that although concept definitions consistently help in task performance, only the larger models (with 70B parameters or more) have limited ability to work under counterfactual contexts. Importantly, only proprietary models such as GPT-3.5 and GPT-4 can recognize nonsensical guidelines, which we hypothesize is due to more sophisticated alignment methods. Finally, we find that Falcon-180B-chat is outperformed by Llama-2-70B-chat is most cases, which indicates that careful fine-tuning is more effective than increasing model scale. Altogether, our simple evaluation method reveals significant gaps in concept understanding between the most capable open-source language models and the leading proprietary APIs.
