Table of Contents
Fetching ...

Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models

Yang Liu, Melissa Xiaohui Qin, Hongming Li, Chao Huang

TL;DR

LexBench offers the first large-scale, unified benchmark suite for semantic phrase processing across ten tasks spanning idioms, noun compounds, verbal constructions, and lexical collocations. By evaluating fifteen models, including GPT-4, Claude-3, and various open- and closed-access systems, it analyzes performance in classification, extraction, and interpretation under zero-shot and few-shot prompting, and introduces Oracle Prompting to improve extraction. Key findings show scaling laws hold for many tasks, few-shot LMs lag behind fine-tuned models in some extraction tasks, and human performance is matched or exceeded in several categories, though gaps remain, especially in extraction. The results provide guidance for future methods aimed at advancing LM grounding and semantic phrase comprehension, and highlight the potential of prompting strategies to raise performance, while also exposing limitations in multilingual coverage and long-tail phrase phenomena.

Abstract

We introduce LexBench, a comprehensive evaluation suite enabled to test language models (LMs) on ten semantic phrase processing tasks. Unlike prior studies, it is the first work to propose a framework from the comparative perspective to model the general semantic phrase (i.e., lexical collocation) and three fine-grained semantic phrases, including idiomatic expression, noun compound, and verbal construction. Thanks to \ourbenchmark, we assess the performance of 15 LMs across model architectures and parameter scales in classification, extraction, and interpretation tasks. Through the experiments, we first validate the scaling law and find that, as expected, large models excel better than the smaller ones in most tasks. Second, we investigate further through the scaling semantic relation categorization and find that few-shot LMs still lag behind vanilla fine-tuned models in the task. Third, through human evaluation, we find that the performance of strong models is comparable to the human level regarding semantic phrase processing. Our benchmarking findings can serve future research aiming to improve the generic capability of LMs on semantic phrase comprehension. Our source code and data are available at https://github.com/jacklanda/LexBench

Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models

TL;DR

LexBench offers the first large-scale, unified benchmark suite for semantic phrase processing across ten tasks spanning idioms, noun compounds, verbal constructions, and lexical collocations. By evaluating fifteen models, including GPT-4, Claude-3, and various open- and closed-access systems, it analyzes performance in classification, extraction, and interpretation under zero-shot and few-shot prompting, and introduces Oracle Prompting to improve extraction. Key findings show scaling laws hold for many tasks, few-shot LMs lag behind fine-tuned models in some extraction tasks, and human performance is matched or exceeded in several categories, though gaps remain, especially in extraction. The results provide guidance for future methods aimed at advancing LM grounding and semantic phrase comprehension, and highlight the potential of prompting strategies to raise performance, while also exposing limitations in multilingual coverage and long-tail phrase phenomena.

Abstract

We introduce LexBench, a comprehensive evaluation suite enabled to test language models (LMs) on ten semantic phrase processing tasks. Unlike prior studies, it is the first work to propose a framework from the comparative perspective to model the general semantic phrase (i.e., lexical collocation) and three fine-grained semantic phrases, including idiomatic expression, noun compound, and verbal construction. Thanks to \ourbenchmark, we assess the performance of 15 LMs across model architectures and parameter scales in classification, extraction, and interpretation tasks. Through the experiments, we first validate the scaling law and find that, as expected, large models excel better than the smaller ones in most tasks. Second, we investigate further through the scaling semantic relation categorization and find that few-shot LMs still lag behind vanilla fine-tuned models in the task. Third, through human evaluation, we find that the performance of strong models is comparable to the human level regarding semantic phrase processing. Our benchmarking findings can serve future research aiming to improve the generic capability of LMs on semantic phrase comprehension. Our source code and data are available at https://github.com/jacklanda/LexBench
Paper Structure (83 sections, 15 figures, 10 tables)

This paper contains 83 sections, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Overall the best performance (i.e., capacity triangle $\triangle$) of models on LexBench
  • Figure 2: We manually curated the works related to semantic phrase processing in NLP published from 2010 to the present in Google Scholar and presented them in descending order (from left to right).
  • Figure 6: The ability of semantic relation categorization of $\mathcal{LC}$ with different numbers of in-context exemplars and semantic category scale. The number $n$ of classes is chosen from $N:= \{1, 2, 4, 8, 16\}$. Each model is prompted with the $k$-shot settings, where $k \in \{0, 3, 5\}$, respectively. Accuracy scores are calculated by the mean values based on 30 examples sampled per class from the test split of espinosa-anke-etal-2021-evaluating, partial categories $(n \le 8)$ are run with three-class combinations in random selection, finally result in the mean value as the average.
  • Figure 7: A data example of idiomacity detection (IED).
  • Figure 8: A data example of idiom extraction (IEE).
  • ...and 10 more figures