Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA
Sergey Pletenev, Maria Marina, Nikolay Ivanov, Daria Galimzianova, Nikita Krayko, Mikhail Salnikov, Vasily Konovalov, Alexander Panchenko, Viktor Moskvoretskii
TL;DR
The paper tackles QA reliability by treating question temporality as a first-class factor, introducing EverGreenQA—a multilingual evergreen-ness dataset with 4,757 questions across 7 languages—and EG-E5, a lightweight multilingual classifier that achieves state-of-the-art performance. It benchmarks 12 LLMs for explicit verbalized evergreen judgments and implicit uncertainty signals, revealing that explicit judgments outperform uncertainty, especially across languages. The authors demonstrate practical utility through three applications: improving self-knowledge estimation, filtering QA datasets for fair evaluation, and explaining GPT-4o retrieval behavior. The work enables more trustworthy QA systems and scalable multilingual dataset curation by foregrounding evergreen-ness as a core criterion for evaluation and retrieval. Key findings include that evergreen probability significantly boosts self-knowledge calibration, that a nontrivial portion of existing QA data is mutable and thus can bias evaluation, and that GPT-4o’s retrieval decisions align closely with question temporality. The dataset and classifier are released to support further research in trustworthy, time-aware QA across languages.
Abstract
Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.
