Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation
Safeyah Khaled Alshemali, Daniel Bauer, Yuval Marton
TL;DR
The study asks whether autoregressive LLMs encode linguistic knowledge needed to assess thematic fit for event participants and how to best elicit it. It conducts a comprehensive prompting study across three axes—Reasoning form, Input form, and Output form—comparing simple versus step-by-step prompts, lemma-tuple versus generated-sentence inputs, and numeric versus categorical outputs. Closed models (GPT-4 variants) achieve state-of-the-art results on four thematic-fit benchmarks, with Step-by-Step prompting often helping, while open models lag and respond differently to input and output configurations. A key finding is that generated sentences can hurt closed models but enable open models to leverage input contexts via semantic filtering, highlighting that elicitation strategies interact strongly with model families and prompting design. The work suggests that improved sentence generation and filtering, more diverse data, and cross-linguistic evaluation are needed to generalize thematic-fit capabilities across LLMs and tasks.
Abstract
We show closed models possess much thematic fit knowledge and set a new state of the art, while open models also seem to capture much relevant knowledge (in semantic filtering), but yield lower scores. Surprisingly, multi-step reasoning only helped closed models (with few exceptions); generated sentences hurt closed models' performance; and output form had little to no effect. We analyze the reasons for these findings, and conclude that more foundational work is needed for a single LLM to perform the best on all tasks with the same experimental condition, let alone improve results further. Source code is available at: https://github.com/SafeyahShemali/LLM_Thematic_Fit_25
