ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution
Xuanming Zhang, Zixun Chen, Zhou Yu
TL;DR
ProLex tackles the gap in lexical substitution research by introducing language proficiency-aware substitutions and a dedicated benchmark. It combines a data pipeline from the TOEFL-11 learner corpus, grammar correction, and GPT-4 candidate generation with human judgments and CEFR-based filtering to yield final substitutes that are both contextually appropriate and at least as proficient as the target word. The work demonstrates that instruction-tuned open models on task-specific synthetic data can outperform zero-shot LLM prompts and approach the performance of GPT-4, validating the practicality of proficiency-oriented substitutions for language learners. By releasing ProLex and its methodology, the paper provides a resource to evaluate and develop systems that directly support vocabulary development and advanced writing for L2 learners.
Abstract
Lexical Substitution discovers appropriate substitutes for a given target word in a context sentence. However, the task fails to consider substitutes that are of equal or higher proficiency than the target, an aspect that could be beneficial for language learners looking to improve their writing. To bridge this gap, we propose a new task, language proficiency-oriented lexical substitution. We also introduce ProLex, a novel benchmark designed to assess systems' ability to generate not only appropriate substitutes but also substitutes that demonstrate better language proficiency. Besides the benchmark, we propose models that can automatically perform the new task. We show that our best model, a Llama2-13B model fine-tuned with task-specific synthetic data, outperforms ChatGPT by an average of 3.2% in F-score and achieves comparable results with GPT-4 on ProLex.
