HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits

Tim Franzmeyer; Aleksandar Shtedritski; Samuel Albanie; Philip Torr; João F. Henriques; Jakob N. Foerster

HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits

Tim Franzmeyer, Aleksandar Shtedritski, Samuel Albanie, Philip Torr, João F. Henriques, Jakob N. Foerster

TL;DR

HelloFresh introduces a living benchmark for evaluating LLMs using continuous streams of real-world human editorial actions from X community notes and Wikipedia edits, addressing test-data contamination and benchmark overfitting. It defines zero-shot and web-search evaluation regimes, demonstrates temporal consistency in model rankings, and provides a public leaderboard with quarterly data releases to enable ongoing, ground-truth evaluation. The approach emphasizes grounding LLMs with external sources and analyzes prompt sensitivity, recall thresholds, and voter-count effects, offering practical insights for robust, up-to-date evaluation. The work points to extensions into multi-modality, justification generation, and deeper analysis of community dynamics to further improve real-world LLM reliability and safety.

Abstract

Benchmarks have been essential for driving progress in machine learning. A better understanding of LLM capabilities on real world tasks is vital for safe development. Designing adequate LLM benchmarks is challenging: Data from real-world tasks is hard to collect, public availability of static evaluation data results in test data contamination and benchmark overfitting, and periodically generating new evaluation data is tedious and may result in temporally inconsistent results. We introduce HelloFresh, based on continuous streams of real-world data generated by intrinsically motivated human labelers. It covers recent events from X (formerly Twitter) community notes and edits of Wikipedia pages, mitigating the risk of test data contamination and benchmark overfitting. Any X user can propose an X note to add additional context to a misleading post (formerly tweet); if the community classifies it as helpful, it is shown with the post. Similarly, Wikipedia relies on community-based consensus, allowing users to edit articles or revert edits made by other users. Verifying whether an X note is helpful or whether a Wikipedia edit should be accepted are hard tasks that require grounding by querying the web. We backtest state-of-the-art LLMs supplemented with simple web search access and find that HelloFresh yields a temporally consistent ranking. To enable continuous evaluation on HelloFresh, we host a public leaderboard and periodically updated evaluation data at https://tinyurl.com/hello-fresh-LLM.

HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits

TL;DR

Abstract

Paper Structure (44 sections, 1 equation, 10 figures)

This paper contains 44 sections, 1 equation, 10 figures.

Introduction
Related Work
Background
X Community Notes
Editing of Wikipedia Articles
Benchmark Setup and Evaluation
Dataset Creation
X Community Notes
Wikipedia Edits
Task Formulation
Implementation of Evaluation Regimes
Prompt Selection
Zero-shot Classification Regime
Web-search Regime
Classification of Model Outputs
...and 29 more sections

Figures (10)

Figure 1: Examples of X community notes. The left note was classified as helpful by the voters on X, while the right note was not.
Figure 2: Examples of two Wikipedia edits. The left edit is classified as 'accepted' by our algorithm, while the right edit is classified as 'rejected' as it was later reverted. In both cases, the article remained unchanged for a certain number of edits after the edit or reversion of the edit, respectively. Figure \ref{['fig:edit_classification']} shows a scheme of the algorithm used to filter for 'accepted' and 'rejected' edits.
Figure 3: Classification of Wikipedia edits.
Figure 4: We observe that the ranking by zero-shot classifier F1 score of different models is largely temporally consistent. GPT4 consistently ranks first X notes (left plot), while GPT3.5 consistently ranks first on Wikipedia edits. Note that GPT4 achieved a F1 score of less than 20% on Wikipedia edits, which is much higher for the web-search agent (see Figure \ref{['fig:temporal_consistency_websearch']}).
Figure 5: We plot the difference in F1 score between the zero-shot classifier and the web-search agent for the manually-written prompt and the different rephrasings of it. We first observe that outputs are generally highly sensitive to the wording of the prompt, i.e. results for different phrasings of the manually-written prompt are very different. We further observe that in almost all cases, the F1 score is higher for the web-search agent than for the zero-shot classifier for X notes (left side). Interestingly, this is very different from Wikipedia edits (right side), where we observe large positive and negative changes in F1 scores.
...and 5 more figures

HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits

TL;DR

Abstract

HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits

Authors

TL;DR

Abstract

Table of Contents

Figures (10)