Table of Contents
Fetching ...

Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation

Safeyah Khaled Alshemali, Daniel Bauer, Yuval Marton

TL;DR

The study asks whether autoregressive LLMs encode linguistic knowledge needed to assess thematic fit for event participants and how to best elicit it. It conducts a comprehensive prompting study across three axes—Reasoning form, Input form, and Output form—comparing simple versus step-by-step prompts, lemma-tuple versus generated-sentence inputs, and numeric versus categorical outputs. Closed models (GPT-4 variants) achieve state-of-the-art results on four thematic-fit benchmarks, with Step-by-Step prompting often helping, while open models lag and respond differently to input and output configurations. A key finding is that generated sentences can hurt closed models but enable open models to leverage input contexts via semantic filtering, highlighting that elicitation strategies interact strongly with model families and prompting design. The work suggests that improved sentence generation and filtering, more diverse data, and cross-linguistic evaluation are needed to generalize thematic-fit capabilities across LLMs and tasks.

Abstract

We show closed models possess much thematic fit knowledge and set a new state of the art, while open models also seem to capture much relevant knowledge (in semantic filtering), but yield lower scores. Surprisingly, multi-step reasoning only helped closed models (with few exceptions); generated sentences hurt closed models' performance; and output form had little to no effect. We analyze the reasons for these findings, and conclude that more foundational work is needed for a single LLM to perform the best on all tasks with the same experimental condition, let alone improve results further. Source code is available at: https://github.com/SafeyahShemali/LLM_Thematic_Fit_25

Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation

TL;DR

The study asks whether autoregressive LLMs encode linguistic knowledge needed to assess thematic fit for event participants and how to best elicit it. It conducts a comprehensive prompting study across three axes—Reasoning form, Input form, and Output form—comparing simple versus step-by-step prompts, lemma-tuple versus generated-sentence inputs, and numeric versus categorical outputs. Closed models (GPT-4 variants) achieve state-of-the-art results on four thematic-fit benchmarks, with Step-by-Step prompting often helping, while open models lag and respond differently to input and output configurations. A key finding is that generated sentences can hurt closed models but enable open models to leverage input contexts via semantic filtering, highlighting that elicitation strategies interact strongly with model families and prompting design. The work suggests that improved sentence generation and filtering, more diverse data, and cross-linguistic evaluation are needed to generalize thematic-fit capabilities across LLMs and tasks.

Abstract

We show closed models possess much thematic fit knowledge and set a new state of the art, while open models also seem to capture much relevant knowledge (in semantic filtering), but yield lower scores. Surprisingly, multi-step reasoning only helped closed models (with few exceptions); generated sentences hurt closed models' performance; and output form had little to no effect. We analyze the reasons for these findings, and conclude that more foundational work is needed for a single LLM to perform the best on all tasks with the same experimental condition, let alone improve results further. Source code is available at: https://github.com/SafeyahShemali/LLM_Thematic_Fit_25

Paper Structure

This paper contains 41 sections, 2 figures, 12 tables.

Figures (2)

  • Figure 1: Experiment Method. For more details of the Step-by-Step Prompting and Semantic Filtering, see §\ref{['sec:reasoning']}. The output of Exp.3.x - Exp.4.x contains the score's justification in addition to the score.
  • Figure 2: Effect of Early Bad Reasoning. The example was taken from preliminary experimentation.