EyePCR: A Comprehensive Benchmark for Fine-Grained Perception, Knowledge Comprehension and Clinical Reasoning in Ophthalmic Surgery
Gui Wang, Yang Wennuo, Xusen Ma, Zehao Zhong, Zhuoru Wu, Ende Wu, Rong Qu, Wooi Ping Cheah, Jianfeng Ren, Linlin Shen
TL;DR
EyePCR tackles the lack of domain-specific benchmarks for surgical cognition by introducing a three-stage PCR framework (Perception, Comprehension, Reasoning) for ophthalmic surgery, underpinned by a large-scale, knowledge-grounded VQA corpus. It couples fine-grained, multi-view perception with a knowledge-anchored scene-graph comprehension layer and four clinically oriented reasoning tasks, all supported by a substantial knowledge graph and reasoning paths. The dataset comprises over 210K VQAs across 82K video segments and demonstrates how domain adaptation (EyePCR-MLLM) improves perception and brings model performance closer to expert clinicians and elite commercial systems. The work highlights significant gaps in current MLLMs for surgical cognition and establishes EyePCR as a foundational benchmark for developing reliable, interpretable, and knowledge-consistent surgical video understanding models.
Abstract
MLLMs (Multimodal Large Language Models) have showcased remarkable capabilities, but their performance in high-stakes, domain-specific scenarios like surgical settings, remains largely under-explored. To address this gap, we develop \textbf{EyePCR}, a large-scale benchmark for ophthalmic surgery analysis, grounded in structured clinical knowledge to evaluate cognition across \textit{Perception}, \textit{Comprehension} and \textit{Reasoning}. EyePCR offers a richly annotated corpus with more than 210k VQAs, which cover 1048 fine-grained attributes for multi-view perception, medical knowledge graph of more than 25k triplets for comprehension, and four clinically grounded reasoning tasks. The rich annotations facilitate in-depth cognitive analysis, simulating how surgeons perceive visual cues and combine them with domain knowledge to make decisions, thus greatly improving models' cognitive ability. In particular, \textbf{EyePCR-MLLM}, a domain-adapted variant of Qwen2.5-VL-7B, achieves the highest accuracy on MCQs for \textit{Perception} among compared models and outperforms open-source models in \textit{Comprehension} and \textit{Reasoning}, rivalling commercial models like GPT-4.1. EyePCR reveals the limitations of existing MLLMs in surgical cognition and lays the foundation for benchmarking and enhancing clinical reliability of surgical video understanding models.
