Assessing LLM Reliability on Temporally Recent Open-Domain Questions
Pushwitha Krishnappa, Amit Das, Vinija Jain, Tathagata Mukherjee, Aman Chadha
TL;DR
This work analyzes LLM reliability on temporally recent open-domain questions by constructing RECOM, a benchmark of 15,000 Reddit questions with community-derived reference answers. It evaluates four open-source LLMs using a multi-dimensional framework spanning lexical, semantic, and logical-inference metrics, revealing a strong semantic-lexical paradox: models achieve 99%+ cosine similarity with references while BLEU-1 overlaps remain under 8%. The study shows model scale does not predict performance, with a 7B model (Mistral) outperforming a 20B model (GPT-OSS-20B) across metrics, and positions MoverScore as a middle ground metric reflecting semantic transport costs. These findings challenge reliance on lexical similarity for abstractive generation and advocate multi-faceted evaluation to capture semantic fidelity and alignment with community perspectives in dynamic information contexts.
Abstract
Large Language Models (LLMs) are increasingly deployed for open-domain question answering, yet their alignment with human perspectives on temporally recent information remains underexplored. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a benchmark dataset of 15,000 recent Reddit questions from September 2025 paired with community-derived reference answers. We investigate how four open-source LLMs (Llama3.1-8B, Mistral-7B, Gemma-2-9B, and GPT-OSS-20B) respond to these questions, evaluating alignment using lexical metrics (BLEU, ROUGE), semantic similarity (BERTScore, MoverScore, cosine similarity), and logical inference (NLI). Our central finding is a striking semantic-lexical paradox: all models achieve over 99% cosine similarity with references despite less than 8% BLEU-1 overlap, a 90+ percentage point gap indicating that models preserve meaning through extensive paraphrasing rather than lexical reproduction. MoverScore (51-53%) confirms this pattern, occupying an intermediate position that reflects the optimal transport cost of semantic alignment. Furthermore, model scale does not predict performance: Mistral-7B (7B parameters) outperforms GPT-OSS-20B (20B parameters) across all metrics. NLI analysis reveals that contradiction rates remain below 7%, suggesting models rarely generate content that directly conflicts with human consensus. These findings challenge the reliability of lexical metrics for evaluating abstractive generation and argue for multi-dimensional evaluation frameworks that capture semantic fidelity beyond surface-level text matching. The RECOM dataset is publicly available at https://anonymous.4open.science/r/recom-D4B0
