Table of Contents
Fetching ...

Better Benchmarking LLMs for Zero-Shot Dependency Parsing

Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares

TL;DR

The paper benchmarks zero-shot dependency parsing for open-weight LLMs against uninformed baselines, formalizing the task with a dependency graph $G=(W,A)$ over a sentence $W$ and analyzing outputs in CoNLL format. It introduces both conventional and novel uninformed baselines, including uniformly random projective trees and optimal-distance projective arrangements, to establish robust benchmarks. Across English, French, German, and Hindi, only the largest LLaMA 3 70B models show consistent, modest improvements over baselines, while most models perform near baselines, casting doubt on open-weight LLMs as zero-shot parsers. The study highlights the need for substantial scaling or alternative prompting/postprocessing strategies to make zero-shot parsing practically viable with open models.

Abstract

While LLMs excel in zero-shot tasks, their performance in linguistic challenges like syntactic parsing has been less scrutinized. This paper studies state-of-the-art open-weight LLMs on the task by comparing them to baselines that do not have access to the input sentence, including baselines that have not been used in this context such as random projective trees or optimal linear arrangements. The results show that most of the tested LLMs cannot outperform the best uninformed baselines, with only the newest and largest versions of LLaMA doing so for most languages, and still achieving rather low performance. Thus, accurate zero-shot syntactic parsing is not forthcoming with open LLMs.

Better Benchmarking LLMs for Zero-Shot Dependency Parsing

TL;DR

The paper benchmarks zero-shot dependency parsing for open-weight LLMs against uninformed baselines, formalizing the task with a dependency graph over a sentence and analyzing outputs in CoNLL format. It introduces both conventional and novel uninformed baselines, including uniformly random projective trees and optimal-distance projective arrangements, to establish robust benchmarks. Across English, French, German, and Hindi, only the largest LLaMA 3 70B models show consistent, modest improvements over baselines, while most models perform near baselines, casting doubt on open-weight LLMs as zero-shot parsers. The study highlights the need for substantial scaling or alternative prompting/postprocessing strategies to make zero-shot parsing practically viable with open models.

Abstract

While LLMs excel in zero-shot tasks, their performance in linguistic challenges like syntactic parsing has been less scrutinized. This paper studies state-of-the-art open-weight LLMs on the task by comparing them to baselines that do not have access to the input sentence, including baselines that have not been used in this context such as random projective trees or optimal linear arrangements. The results show that most of the tested LLMs cannot outperform the best uninformed baselines, with only the newest and largest versions of LLaMA doing so for most languages, and still achieving rather low performance. Thus, accurate zero-shot syntactic parsing is not forthcoming with open LLMs.

Paper Structure

This paper contains 29 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Prompt and output after the first post-processing. See Figure \ref{['fig:prompt-example-complete']} for step-by-step process.
  • Figure 2: F-score across displacements in the EnglishEWT test set.
  • Figure 3: Dependency parsing prompt and the resulting tree after the second post-processing step. Figure \ref{['fig:prompt-example']} showed the original tree.
  • Figure 4: F-score across displacements in the FrenchGSD test set.
  • Figure 5: F-score across displacements in the GermanGSD test set.
  • ...and 1 more figures