Table of Contents
Fetching ...

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

Bowen Qin, Chen Yue, Fang Yin, Hui Wang, JG Yao, Jiakang Liu, Jing-Shu Zheng, Miguel Hu Chen, Richeng Xuan, Shibei Meng, Shiqi Zhou, Teng Dai, Tong-Shuai Ren, Wei Cui, Xi Yang, Xialin Du, Xiaojing Xu, Xue Sun, Xuejing Li, Yaming Liu, Yesheng Liu, Ying Liu, Yonghua Lin, Yu Zhao, Yunduo Zhang, Yuwen Luo, Zheqi He, Zhiyuan He, Zhongyuan Wang

TL;DR

This work presents a contamination-aware, moderate-scale evaluation of large reasoning models (LRMs) across textual and visual tasks, introducing the Reasoning-Oriented Multimodal Evaluation (ROME) benchmark for vision-language models. It combines automatic prompt verification with LLM-assisted analysis of reasoning traces to study macro-behavioral properties, token efficiency, and safety, highlighting persistent misalignment signals, overconfidence, and hallucination of external tool use. Across textual and visual domains, the study finds that test-time thinking yields mixed or limited gains on many tasks, with some models (e.g., GPT-5 series, Gemini 2.5 Pro) achieving strong performance in select categories while others show substantial variability and saturation effects. The paper advocates greater transparency, more consistent reasoning, improved visual perception, and broader, creative benchmarking to guide future development and responsible deployment of LRMs in real-world settings.

Abstract

We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

TL;DR

This work presents a contamination-aware, moderate-scale evaluation of large reasoning models (LRMs) across textual and visual tasks, introducing the Reasoning-Oriented Multimodal Evaluation (ROME) benchmark for vision-language models. It combines automatic prompt verification with LLM-assisted analysis of reasoning traces to study macro-behavioral properties, token efficiency, and safety, highlighting persistent misalignment signals, overconfidence, and hallucination of external tool use. Across textual and visual domains, the study finds that test-time thinking yields mixed or limited gains on many tasks, with some models (e.g., GPT-5 series, Gemini 2.5 Pro) achieving strong performance in select categories while others show substantial variability and saturation effects. The paper advocates greater transparency, more consistent reasoning, improved visual perception, and broader, creative benchmarking to guide future development and responsible deployment of LRMs in real-world settings.

Abstract

We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/

Paper Structure

This paper contains 66 sections, 41 figures, 22 tables.

Figures (41)

  • Figure 1: Scatter plots of mean±std on overall averaged accuracy scores and token consumption for textual (left) and visual (right) problems, with an outlier (Qwen3-Next-thinking, taking around 30k tokens on average) omitted in the left figure. Aggregated overall metrics could be misleading if you don't know how they are formed. The breakdown sections and plots for subcategories in the appendix are worth more attention.
  • Figure 2: Claude Sonnet 4 on a game theory problem: The analysis contained two principal errors. The first was methodological: the use of a non-standard cost model ($x (t + 1)$) that does not account for baseline travel time inherent in network edges. The second was a logical flaw in the conclusion: despite calculations showing the equilibrium cost ratio to be a function of a parameter $t$, a single numerical answer was presented based on an unsubstantiated choice for t, contradicting the preceding mathematical proof.
  • Figure 3: Inconsistent answers in reasoning and response from Gemini 2.5 Flash: the reasoning process repetitively concludes that the answer should be "Mango" and "Walrus", but the actual response gives a different answer of "Worm" which never appears in the reasoning summary.
  • Figure 4: Gemini 2.5 Pro got an answer correct from no clue in reasoning: the reasoning process indicates a very different answer while did not mention the actual final answer even for once in the reasoning summary. The reasoning trace also claims that a program has been written, but still gives an invalid pair.
  • Figure 5: Claude Sonnet 4 answering a simple factoid question on long-tailed knowledge: it gives a deterministic (but false) answer without thinking, but abstains when thinking is enabled.
  • ...and 36 more figures