Table of Contents
Fetching ...

Leveraging Open-Source Large Language Models for Native Language Identification

Yee Man Ng, Ilia Markov

TL;DR

The paper tackles Native Language Identification (NLI) by comparing open-source LLMs to closed-source models on TOEFL11 and ICLE-NLI, evaluating both out-of-the-box and fine-tuned configurations. It demonstrates that open-source LLMs lag behind closed-source models in zero-shot use but can achieve comparable performance when task-specific fine-tuning (e.g., with QLoRA and 4-bit quantization) is applied, with Gemma and LLaMA-3 among the standout open-source results. The study highlights the practical benefits of open-source LLMs, including transparency and the ability to fine-tune, while also emphasizing issues such as data leakage risk and cross-corpus generalization. Overall, the work provides evidence that fine-tuned open-source LLMs can approach or match proprietary LLMs on benchmark NLI tasks, suggesting a viable and more auditable path for NLI research and applications.

Abstract

Native Language Identification (NLI) - the task of identifying the native language (L1) of a person based on their writing in the second language (L2) - has applications in forensics, marketing, and second language acquisition. Historically, conventional machine learning approaches that heavily rely on extensive feature engineering have outperformed transformer-based language models on this task. Recently, closed-source generative large language models (LLMs), e.g., GPT-4, have demonstrated remarkable performance on NLI in a zero-shot setting, including promising results in open-set classification. However, closed-source LLMs have many disadvantages, such as high costs and undisclosed nature of training data. This study explores the potential of using open-source LLMs for NLI. Our results indicate that open-source LLMs do not reach the accuracy levels of closed-source LLMs when used out-of-the-box. However, when fine-tuned on labeled training data, open-source LLMs can achieve performance comparable to that of commercial LLMs.

Leveraging Open-Source Large Language Models for Native Language Identification

TL;DR

The paper tackles Native Language Identification (NLI) by comparing open-source LLMs to closed-source models on TOEFL11 and ICLE-NLI, evaluating both out-of-the-box and fine-tuned configurations. It demonstrates that open-source LLMs lag behind closed-source models in zero-shot use but can achieve comparable performance when task-specific fine-tuning (e.g., with QLoRA and 4-bit quantization) is applied, with Gemma and LLaMA-3 among the standout open-source results. The study highlights the practical benefits of open-source LLMs, including transparency and the ability to fine-tune, while also emphasizing issues such as data leakage risk and cross-corpus generalization. Overall, the work provides evidence that fine-tuned open-source LLMs can approach or match proprietary LLMs on benchmark NLI tasks, suggesting a viable and more auditable path for NLI research and applications.

Abstract

Native Language Identification (NLI) - the task of identifying the native language (L1) of a person based on their writing in the second language (L2) - has applications in forensics, marketing, and second language acquisition. Historically, conventional machine learning approaches that heavily rely on extensive feature engineering have outperformed transformer-based language models on this task. Recently, closed-source generative large language models (LLMs), e.g., GPT-4, have demonstrated remarkable performance on NLI in a zero-shot setting, including promising results in open-set classification. However, closed-source LLMs have many disadvantages, such as high costs and undisclosed nature of training data. This study explores the potential of using open-source LLMs for NLI. Our results indicate that open-source LLMs do not reach the accuracy levels of closed-source LLMs when used out-of-the-box. However, when fine-tuned on labeled training data, open-source LLMs can achieve performance comparable to that of commercial LLMs.
Paper Structure (24 sections, 1 figure, 1 table)

This paper contains 24 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Confusion matrices for GPT-4 on TOEFL zhang-2023 (top left), Gemma (7B) (fine-tuned) on TOEFL (top right). GPT-4 on ICLE-NLI (bottom left), Gemma (7B) (fine-tuned) on ICLE-NLI (bottom right).