Assessing LLM Reliability on Temporally Recent Open-Domain Questions

Pushwitha Krishnappa; Amit Das; Vinija Jain; Tathagata Mukherjee; Aman Chadha

Assessing LLM Reliability on Temporally Recent Open-Domain Questions

Pushwitha Krishnappa, Amit Das, Vinija Jain, Tathagata Mukherjee, Aman Chadha

TL;DR

This work analyzes LLM reliability on temporally recent open-domain questions by constructing RECOM, a benchmark of 15,000 Reddit questions with community-derived reference answers. It evaluates four open-source LLMs using a multi-dimensional framework spanning lexical, semantic, and logical-inference metrics, revealing a strong semantic-lexical paradox: models achieve 99%+ cosine similarity with references while BLEU-1 overlaps remain under 8%. The study shows model scale does not predict performance, with a 7B model (Mistral) outperforming a 20B model (GPT-OSS-20B) across metrics, and positions MoverScore as a middle ground metric reflecting semantic transport costs. These findings challenge reliance on lexical similarity for abstractive generation and advocate multi-faceted evaluation to capture semantic fidelity and alignment with community perspectives in dynamic information contexts.

Abstract

Large Language Models (LLMs) are increasingly deployed for open-domain question answering, yet their alignment with human perspectives on temporally recent information remains underexplored. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a benchmark dataset of 15,000 recent Reddit questions from September 2025 paired with community-derived reference answers. We investigate how four open-source LLMs (Llama3.1-8B, Mistral-7B, Gemma-2-9B, and GPT-OSS-20B) respond to these questions, evaluating alignment using lexical metrics (BLEU, ROUGE), semantic similarity (BERTScore, MoverScore, cosine similarity), and logical inference (NLI). Our central finding is a striking semantic-lexical paradox: all models achieve over 99% cosine similarity with references despite less than 8% BLEU-1 overlap, a 90+ percentage point gap indicating that models preserve meaning through extensive paraphrasing rather than lexical reproduction. MoverScore (51-53%) confirms this pattern, occupying an intermediate position that reflects the optimal transport cost of semantic alignment. Furthermore, model scale does not predict performance: Mistral-7B (7B parameters) outperforms GPT-OSS-20B (20B parameters) across all metrics. NLI analysis reveals that contradiction rates remain below 7%, suggesting models rarely generate content that directly conflicts with human consensus. These findings challenge the reliability of lexical metrics for evaluating abstractive generation and argue for multi-dimensional evaluation frameworks that capture semantic fidelity beyond surface-level text matching. The RECOM dataset is publicly available at https://anonymous.4open.science/r/recom-D4B0

Assessing LLM Reliability on Temporally Recent Open-Domain Questions

TL;DR

Abstract

Paper Structure (41 sections, 7 figures, 12 tables)

This paper contains 41 sections, 7 figures, 12 tables.

Introduction
Related Work
Methodology
Dataset
Data Collection
Engagement-Based Filtering and Sampling
Construction of Reference Answer
Evaluated Models
Response Generation
Response Filtering
Evaluation Framework
Lexical Overlap Metrics
Semantic Similarity Metrics
Logical Inference Metrics
Statistical Analysis
...and 26 more sections

Figures (7)

Figure 1: The LLM Alignment Evaluation for RECOM. 1. Extract Reddit questions and summarize human responses to construct reference answers. 2. Four LLMs generate responses to the same questions. 3. Evaluate outputs using lexical, semantic, and NLI metrics.
Figure 2: BLEU score comparison across models. BLEU-1 ranges from 0.57% (Gemma2-9B) to 7.58% (Llama-3.1-8B)---a 13$\times$ difference---while BLEU-4 scores remain below 1% for all models. Llama-3.1-8B's elevated scores may partially reflect self-alignment bias, as it was also used for reference summarization.
Figure 3: ROUGE F1 score comparison across models. Mistral-7B achieves highest scores across all variants (ROUGE-1: 19.97%, ROUGE-L: 13.16%), followed closely by Llama-3.1-8B. Gemma2-9B shows lowest overlap (ROUGE-1: 9.13%), consistent with its extreme abstraction profile. ROUGE-2 scores remain below 4% for all models, indicating minimal bigram overlap.
Figure 4: BERTScore comparison across models (RoBERTa-large embeddings). Despite the 13$\times$ variation in BLEU-1 scores (Figure \ref{['fig:bleu-comparison']}), BERTScore F1 varies by only 1.54 percentage points (83.29%--84.83%). Precision exceeds recall for all models, indicating generated answers contain semantically relevant content but are more concise than reference summaries.
Figure 5: Cosine similarity comparison across models. All models exceed 99% similarity in RoBERTa-large embedding space, with only 0.41 percentage points separating Mistral-7B (99.51%) from GPT-OSS-20B (99.10%). This near-ceiling performance contrasts sharply with the $<$8% BLEU-1 scores---the 90+ percentage point gap constitutes our central finding.
...and 2 more figures

Assessing LLM Reliability on Temporally Recent Open-Domain Questions

TL;DR

Abstract

Assessing LLM Reliability on Temporally Recent Open-Domain Questions

Authors

TL;DR

Abstract

Table of Contents

Figures (7)