Table of Contents
Fetching ...

Auditing the Reliability of Multimodal Generative Search

Erfan Samieyan Sahneh, Luca Maria Aiello

Abstract

Multimodal Large Language Models (MLLMs) increasingly function as generative search systems that retrieve and synthesize answers from multimedia content, including YouTube videos. Although these systems project authority by citing specific videos as evidence, the extent to which these citations genuinely substantiate the generated claims remains unexamined. We present a large-scale audit of the Gemini 2.5 Pro multimodal search system, analyzing 11,943 claim-video pairs generated across Medical, Economic, and General domains. Through automated verification using three independent LLM judges (87.7% inter-rater agreement), validated against human annotations, we find that depending on the judge's strictness, between 3.7% and 18.7% of video-grounded claims are not supported by their cited sources. The dominant failure modes are not outright contradictions but rather unverifiable specificities and overstated claims, suggesting the system injects precise but ungrounded details from parametric knowledge while citing videos as evidence. Exploratory post-hoc analysis via logistic regression reveals properties associated with these failures: claims departing from source vocabulary ($β= -1.6$ to $-3.1$, $p < 0.01$) and claims with low semantic similarity to the video transcript ($β= -2.1$ to $-11.6$, $p < 0.01$) are significantly more likely to be unsupported. These findings characterize the current trustworthiness of video-based generative search and highlight the gap between the confidence these systems project and the fidelity of their outputs.

Auditing the Reliability of Multimodal Generative Search

Abstract

Multimodal Large Language Models (MLLMs) increasingly function as generative search systems that retrieve and synthesize answers from multimedia content, including YouTube videos. Although these systems project authority by citing specific videos as evidence, the extent to which these citations genuinely substantiate the generated claims remains unexamined. We present a large-scale audit of the Gemini 2.5 Pro multimodal search system, analyzing 11,943 claim-video pairs generated across Medical, Economic, and General domains. Through automated verification using three independent LLM judges (87.7% inter-rater agreement), validated against human annotations, we find that depending on the judge's strictness, between 3.7% and 18.7% of video-grounded claims are not supported by their cited sources. The dominant failure modes are not outright contradictions but rather unverifiable specificities and overstated claims, suggesting the system injects precise but ungrounded details from parametric knowledge while citing videos as evidence. Exploratory post-hoc analysis via logistic regression reveals properties associated with these failures: claims departing from source vocabulary ( to , ) and claims with low semantic similarity to the video transcript ( to , ) are significantly more likely to be unsupported. These findings characterize the current trustworthiness of video-based generative search and highlight the gap between the confidence these systems project and the fidelity of their outputs.

Paper Structure

This paper contains 24 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of the auditing pipeline. Queries from three domains are submitted to a multimodal LLM (Gemini 2.5 Pro), which returns claims citing YouTube videos. Each claim--video pair is split into its factual claim and independently extracted textual content: transcript (generated via Whisper ASR), title, description, and upload date. The combined claim and source text are then submitted to three independent LLM judges for verification, yielding per-claim verdicts that inform error-rate analysis and exploratory regression.
  • Figure 2: Dataset overview across three domains. (a) Claim length distributions are right-skewed but similar in shape across domains. (b) Video duration distributions. (c) Upload date distributions of cited videos.
  • Figure 3: Pairwise agreement rates between LLM judges across domains. Circle size and color encode agreement percentage. Gemini-3 and Grok-4.1 show high concordance ($>$96%), while pairs involving gpt-5.2 show lower agreement, reflecting its stricter evaluation criteria.
  • Figure 4: Unsupported Claims (%) by Domain and Judge. gpt-5.2 shows significantly higher error detection rates, reflecting its stricter threshold for OVERSTATED claims.
  • Figure 5: Significant logistic regression coefficients by domain. Each point shows the mean coefficient across three judges; error bars show the averaged 95% confidence intervals. Only features with at least one significant coefficient ($p < 0.1$) across judges are shown. Significance levels: $^{*}p < 0.1$, $^{**}p < 0.05$, $^{***}p < 0.01$. Transcript similarity and noun overlap are consistently the strongest protective predictors across all domains, while word count is the only consistent risk factor.
  • ...and 2 more figures