Better Benchmarking LLMs for Zero-Shot Dependency Parsing

Ana Ezquerro; Carlos Gómez-Rodríguez; David Vilares

Better Benchmarking LLMs for Zero-Shot Dependency Parsing

Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares

TL;DR

The paper benchmarks zero-shot dependency parsing for open-weight LLMs against uninformed baselines, formalizing the task with a dependency graph $G=(W,A)$ over a sentence $W$ and analyzing outputs in CoNLL format. It introduces both conventional and novel uninformed baselines, including uniformly random projective trees and optimal-distance projective arrangements, to establish robust benchmarks. Across English, French, German, and Hindi, only the largest LLaMA 3 70B models show consistent, modest improvements over baselines, while most models perform near baselines, casting doubt on open-weight LLMs as zero-shot parsers. The study highlights the need for substantial scaling or alternative prompting/postprocessing strategies to make zero-shot parsing practically viable with open models.

Abstract

While LLMs excel in zero-shot tasks, their performance in linguistic challenges like syntactic parsing has been less scrutinized. This paper studies state-of-the-art open-weight LLMs on the task by comparing them to baselines that do not have access to the input sentence, including baselines that have not been used in this context such as random projective trees or optimal linear arrangements. The results show that most of the tested LLMs cannot outperform the best uninformed baselines, with only the newest and largest versions of LLaMA doing so for most languages, and still achieving rather low performance. Thus, accurate zero-shot syntactic parsing is not forthcoming with open LLMs.

Better Benchmarking LLMs for Zero-Shot Dependency Parsing

TL;DR

The paper benchmarks zero-shot dependency parsing for open-weight LLMs against uninformed baselines, formalizing the task with a dependency graph

over a sentence

and analyzing outputs in CoNLL format. It introduces both conventional and novel uninformed baselines, including uniformly random projective trees and optimal-distance projective arrangements, to establish robust benchmarks. Across English, French, German, and Hindi, only the largest LLaMA 3 70B models show consistent, modest improvements over baselines, while most models perform near baselines, casting doubt on open-weight LLMs as zero-shot parsers. The study highlights the need for substantial scaling or alternative prompting/postprocessing strategies to make zero-shot parsing practically viable with open models.

Better Benchmarking LLMs for Zero-Shot Dependency Parsing

TL;DR

Abstract

Better Benchmarking LLMs for Zero-Shot Dependency Parsing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)