ChatGPT may Pass the Bar Exam soon, but has a Long Way to Go for the LexGLUE benchmark

Ilias Chalkidis

ChatGPT may Pass the Bar Exam soon, but has a Long Way to Go for the LexGLUE benchmark

Ilias Chalkidis

TL;DR

The paper evaluates GPT-3.5-turbo on the LexGLUE legal text classification benchmark using templated instruction-following prompts in a zero-shot setup, with additional few-shot analyses. It employs exact-match evaluation supplemented by embedding-based similarity, across seven tasks including ECtHR and LEDGAR, to quantify performance and limitations. The results show an average micro-F1 of approximately $49.0\%$ in zero-shot, with notably higher scores of $62.8\%$ on ECtHR B and $70.1\%$ on LEDGAR for some datasets, but overall smaller fine-tuned models outperform the GPT-3.5-turbo approach; few-shot prompts help only under small label-set conditions ($K \approx L$). The study concludes that while ChatGPT possesses non-trivial legal knowledge, production-ready legal classification remains better served by task-specific fine-tuned models or domain-adapted prompts, guiding future work toward stronger, domain-focused LLMs and prompting strategies.

Abstract

Following the hype around OpenAI's ChatGPT conversational agent, the last straw in the recent development of Large Language Models (LLMs) that demonstrate emergent unprecedented zero-shot capabilities, we audit the latest OpenAI's GPT-3.5 model, `gpt-3.5-turbo', the first available ChatGPT model, in the LexGLUE benchmark in a zero-shot fashion providing examples in a templated instruction-following format. The results indicate that ChatGPT achieves an average micro-F1 score of 47.6% across LexGLUE tasks, surpassing the baseline guessing rates. Notably, the model performs exceptionally well in some datasets, achieving micro-F1 scores of 62.8% and 70.2% in the ECtHR B and LEDGAR datasets, respectively. The code base and model predictions are available for review on https://github.com/coastalcph/zeroshot_lexglue.

ChatGPT may Pass the Bar Exam soon, but has a Long Way to Go for the LexGLUE benchmark

TL;DR

in zero-shot, with notably higher scores of

on ECtHR B and

on LEDGAR for some datasets, but overall smaller fine-tuned models outperform the GPT-3.5-turbo approach; few-shot prompts help only under small label-set conditions (

). The study concludes that while ChatGPT possesses non-trivial legal knowledge, production-ready legal classification remains better served by task-specific fine-tuned models or domain-adapted prompts, guiding future work toward stronger, domain-focused LLMs and prompting strategies.

Abstract

Paper Structure (7 sections, 1 figure, 3 tables)

This paper contains 7 sections, 1 figure, 3 tables.

Introduction
Experiments
LexGLUE Datasets
Experimental Setup
Results & Discussion
Limitations
Conclusions

Figures (1)

Figure 1: Averaged performance on LexGLUE.

ChatGPT may Pass the Bar Exam soon, but has a Long Way to Go for the LexGLUE benchmark

TL;DR

Abstract

ChatGPT may Pass the Bar Exam soon, but has a Long Way to Go for the LexGLUE benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (1)