LexTime: A Benchmark for Temporal Ordering of Legal Events
Claire Barale, Leslie Barrett, Vikram Sunil Bajaj, Michael Rovatsos
TL;DR
LexTime introduces a dedicated benchmark for temporal event ordering in legal language, built from 512 U.S. federal complaint contexts with explicit and implicit events. The study shows LLMs perform better on legal text than narrative benchmarks and that longer contexts and implicit-explicit event pairs boost accuracy, with a top score of 80.8% for GPT-4 Turbo. CoT prompting generally does not help this task, highlighting the need for domain-specific prompting and modeling strategies to handle legal syntax, paraphrasing, and subordinate clauses. Across dataset construction, linguistic analysis, and benchmarking, the work identifies concrete directions to improve legal temporal reasoning in large language models and provides a resource for evaluating such progress.
Abstract
Understanding temporal relationships and accurately reconstructing the event timeline is important for case law analysis, compliance monitoring, and legal summarization. However, existing benchmarks lack specialized language evaluation, leaving a gap in understanding how LLMs handle event ordering in legal contexts. We introduce LexTime, a dataset designed to evaluate LLMs' event ordering capabilities in legal language, consisting of 512 instances from U.S. Federal Complaints with annotated event pairs and their temporal relations. Our findings show that (1) LLMs are more accurate on legal event ordering than on narrative texts (up to +10.5%); (2) longer input contexts and implicit events boost accuracy, reaching 80.8% for implicit-explicit event pairs; (3) legal linguistic complexities and nested clauses remain a challenge. While performance is promising, specific features of legal texts remain a bottleneck for legal temporal event reasoning, and we propose concrete modeling directions to better address them.
