Table of Contents
Fetching ...

ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution

Xuanming Zhang, Zixun Chen, Zhou Yu

TL;DR

ProLex tackles the gap in lexical substitution research by introducing language proficiency-aware substitutions and a dedicated benchmark. It combines a data pipeline from the TOEFL-11 learner corpus, grammar correction, and GPT-4 candidate generation with human judgments and CEFR-based filtering to yield final substitutes that are both contextually appropriate and at least as proficient as the target word. The work demonstrates that instruction-tuned open models on task-specific synthetic data can outperform zero-shot LLM prompts and approach the performance of GPT-4, validating the practicality of proficiency-oriented substitutions for language learners. By releasing ProLex and its methodology, the paper provides a resource to evaluate and develop systems that directly support vocabulary development and advanced writing for L2 learners.

Abstract

Lexical Substitution discovers appropriate substitutes for a given target word in a context sentence. However, the task fails to consider substitutes that are of equal or higher proficiency than the target, an aspect that could be beneficial for language learners looking to improve their writing. To bridge this gap, we propose a new task, language proficiency-oriented lexical substitution. We also introduce ProLex, a novel benchmark designed to assess systems' ability to generate not only appropriate substitutes but also substitutes that demonstrate better language proficiency. Besides the benchmark, we propose models that can automatically perform the new task. We show that our best model, a Llama2-13B model fine-tuned with task-specific synthetic data, outperforms ChatGPT by an average of 3.2% in F-score and achieves comparable results with GPT-4 on ProLex.

ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution

TL;DR

ProLex tackles the gap in lexical substitution research by introducing language proficiency-aware substitutions and a dedicated benchmark. It combines a data pipeline from the TOEFL-11 learner corpus, grammar correction, and GPT-4 candidate generation with human judgments and CEFR-based filtering to yield final substitutes that are both contextually appropriate and at least as proficient as the target word. The work demonstrates that instruction-tuned open models on task-specific synthetic data can outperform zero-shot LLM prompts and approach the performance of GPT-4, validating the practicality of proficiency-oriented substitutions for language learners. By releasing ProLex and its methodology, the paper provides a resource to evaluate and develop systems that directly support vocabulary development and advanced writing for L2 learners.

Abstract

Lexical Substitution discovers appropriate substitutes for a given target word in a context sentence. However, the task fails to consider substitutes that are of equal or higher proficiency than the target, an aspect that could be beneficial for language learners looking to improve their writing. To bridge this gap, we propose a new task, language proficiency-oriented lexical substitution. We also introduce ProLex, a novel benchmark designed to assess systems' ability to generate not only appropriate substitutes but also substitutes that demonstrate better language proficiency. Besides the benchmark, we propose models that can automatically perform the new task. We show that our best model, a Llama2-13B model fine-tuned with task-specific synthetic data, outperforms ChatGPT by an average of 3.2% in F-score and achieves comparable results with GPT-4 on ProLex.
Paper Structure (46 sections, 1 equation, 4 figures, 14 tables)

This paper contains 46 sections, 1 equation, 4 figures, 14 tables.

Figures (4)

  • Figure 1: The process of creating ProLex. We start by selecting word-sentence $(w, s)$ pairs from TOEFL-11 based on word frequency. Then we use a fine-tuned Grammar Error Correction (GEC) Model to correct basic grammar errors in the selected sentences. We use GPT-4 to generate candidate substitutes, each of which is denoted as $w'$. For each $(w, s, w')$ triple, we ask human expert to assess these $w'$ based on their appropriateness. The resulting list of accetpable substitutes is denoted as $w^a$. For all substitutes in $w^a$, we further apply a CEFR Checker cathovenai2023 to obtain their proficiency levels, and ultimately remove substitutes that demonstrate lower-level proficiency than the target word. This produces our final quadruplets in ProLex, namely $(w, s, w^a, w^a_p)$.
  • Figure 2: Distribution of CEFR levels of target words in low and medium sentences in ProLex.
  • Figure 3: Distribution of CEFR levels for substitutes in $w^a$ and substitutes in $w^a_p$ in ProLex. In low-level sentences, more than 65% of the proficiency-oriented substitutes are from B1 level or higher; similarly, in medium-level sentences, over 75% of these substitutes are sourced from B1 level or above.
  • Figure 4: Distribution of CEFR levels for target words and proficiency-oriented substitutions in $D_S$.