Table of Contents
Fetching ...

SetLexSem Challenge: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models

Bardiya Akhbari, Manish Gawali, Nicholas A. Dronen

TL;DR

It is shown that rigorously measuring language model robustness to variation in frequency and length is challenging and that rigorously measuring language model robustness to variation in frequency and length is challenging and an analysis is presented that measures them independently.

Abstract

Set theory is foundational to mathematics and, when sets are finite, to reasoning about the world. An intelligent system should perform set operations consistently, regardless of superficial variations in the operands. Initially designed for semantically-oriented NLP tasks, large language models (LLMs) are now being evaluated on algorithmic tasks. Because sets are comprised of arbitrary symbols (e.g. numbers, words), they provide an opportunity to test, systematically, the invariance of LLMs' algorithmic abilities under simple lexical or semantic variations. To this end, we present the SetLexSem Challenge, a synthetic benchmark that evaluates the performance of LLMs on set operations. SetLexSem assesses the robustness of LLMs' instruction-following abilities under various conditions, focusing on the set operations and the nature and construction of the set members. Evaluating seven LLMs with SetLexSem, we find that they exhibit poor robustness to variation in both operation and operands. We show -- via the framework's systematic sampling of set members along lexical and semantic dimensions -- that LLMs are not only not robust to variation along these dimensions but demonstrate unique failure modes in particular, easy-to-create semantic groupings of "deceptive" sets. We find that rigorously measuring language model robustness to variation in frequency and length is challenging and present an analysis that measures them independently. The code for reproducing the results of this paper, and for generating the SetLexSem Challenge dataset, is available at \href{https://github.com/amazon-science/SetLexSem-Challenge}{https://github.com/amazon-science/SetLexSem-Challenge}.

SetLexSem Challenge: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models

TL;DR

It is shown that rigorously measuring language model robustness to variation in frequency and length is challenging and that rigorously measuring language model robustness to variation in frequency and length is challenging and an analysis is presented that measures them independently.

Abstract

Set theory is foundational to mathematics and, when sets are finite, to reasoning about the world. An intelligent system should perform set operations consistently, regardless of superficial variations in the operands. Initially designed for semantically-oriented NLP tasks, large language models (LLMs) are now being evaluated on algorithmic tasks. Because sets are comprised of arbitrary symbols (e.g. numbers, words), they provide an opportunity to test, systematically, the invariance of LLMs' algorithmic abilities under simple lexical or semantic variations. To this end, we present the SetLexSem Challenge, a synthetic benchmark that evaluates the performance of LLMs on set operations. SetLexSem assesses the robustness of LLMs' instruction-following abilities under various conditions, focusing on the set operations and the nature and construction of the set members. Evaluating seven LLMs with SetLexSem, we find that they exhibit poor robustness to variation in both operation and operands. We show -- via the framework's systematic sampling of set members along lexical and semantic dimensions -- that LLMs are not only not robust to variation along these dimensions but demonstrate unique failure modes in particular, easy-to-create semantic groupings of "deceptive" sets. We find that rigorously measuring language model robustness to variation in frequency and length is challenging and present an analysis that measures them independently. The code for reproducing the results of this paper, and for generating the SetLexSem Challenge dataset, is available at \href{https://github.com/amazon-science/SetLexSem-Challenge}{https://github.com/amazon-science/SetLexSem-Challenge}.

Paper Structure

This paper contains 20 sections, 9 figures, 19 tables.

Figures (9)

  • Figure 1: To evaluate the robustness of LLMs to semantic variation in set members, we create "deceptive" sets. To construct such sets, we sample a pair of hypernyms (e.g. "mammal" and "vehicle") and, from them, a set of their hyponyms in three conditions: (1) with the hyponyms as sampled, (2) with half of the set members swapped, and (3) randomly sampled. LLMs exhibit a unique failure mode under the second condition (swapped) and the mean and variance in accuracy of the first condition (not swapped) is better than that of the random baseline. See \ref{['fig:related-words-results']} for results.
  • Figure 2: Example of our baseline prompt with sets of size two. Every prompt follows this template: set construction, task definition, demonstrations, and final instructions. Note that the baseline prompt instructs the LLM not to explain its reasoning whereas the chain-of-thought prompt instructs the model to think step by step. In this example, the set members are numbers and each token in a set is two characters long. The prompt explicitly instructs the model not to use external tools, should they be available to it. Additional examples of prompts are provided in the Appendix.
  • Figure 3: Aggregate accuracy of LLMs on SetLexSem. Each distribution consists of 12,400 prompts. This is 400 fewer than the 12,800 that should be expected given (1) that we did not do $k$-shot prompting in these runs and (2) the number of other hyperparameters in Table \ref{['table:prompt-hyperparameters']}. The discrepancy is due to the case where token length is 1, which has fewer prompts due to sampling with replacement.
  • Figure 4: LLM accuracy on set operations varies (a) by operation and (b) by operand size. A violin plot is a distribution of accuracy. Each point in the distribution is the fraction of times correct out of 50 samples of different sets while holding a prompt (and its hyperparameters) constant. See Table \ref{['table:prompt-hyperparameters']} for hyperparameters.
  • Figure 5: LLM accuracy on set operations appears to exhibit exhibits some bias in favor of words over numbers, but this result is inconclusive.
  • ...and 4 more figures