FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions
Bowen Qin, Chen Yue, Fang Yin, Hui Wang, JG Yao, Jiakang Liu, Jing-Shu Zheng, Miguel Hu Chen, Richeng Xuan, Shibei Meng, Shiqi Zhou, Teng Dai, Tong-Shuai Ren, Wei Cui, Xi Yang, Xialin Du, Xiaojing Xu, Xue Sun, Xuejing Li, Yaming Liu, Yesheng Liu, Ying Liu, Yonghua Lin, Yu Zhao, Yunduo Zhang, Yuwen Luo, Zheqi He, Zhiyuan He, Zhongyuan Wang
TL;DR
This work presents a contamination-aware, moderate-scale evaluation of large reasoning models (LRMs) across textual and visual tasks, introducing the Reasoning-Oriented Multimodal Evaluation (ROME) benchmark for vision-language models. It combines automatic prompt verification with LLM-assisted analysis of reasoning traces to study macro-behavioral properties, token efficiency, and safety, highlighting persistent misalignment signals, overconfidence, and hallucination of external tool use. Across textual and visual domains, the study finds that test-time thinking yields mixed or limited gains on many tasks, with some models (e.g., GPT-5 series, Gemini 2.5 Pro) achieving strong performance in select categories while others show substantial variability and saturation effects. The paper advocates greater transparency, more consistent reasoning, improved visual perception, and broader, creative benchmarking to guide future development and responsible deployment of LRMs in real-world settings.
Abstract
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/
