Table of Contents
Fetching ...

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian, Zuyan Liu, Yushi Hu, Haoning Wu, Yuhao Dong, Benlin Liu, Ziwei Liu, Ranjay Krishna

Abstract

We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

Abstract

We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.

Paper Structure

This paper contains 30 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Overview of the PerceptionComp benchmark. (a) An example from PerceptionComp, where models are required to perform complex, perception-centric reasoning with various types of subconditions to arrive at the final answer. (b) Results from a human study measuring question-answering time, showing that PerceptionComp is more challenging for humans than previous perception and reasoning video benchmarks, largely due to its emphasis on perception-centric reasoning.
  • Figure 2: Data construction and statistics of PerceptionComp. (a) Annotation pipeline, which integrates diverse subconditions and supports two types of compositional questions. (b) Benchmark statistics: higher difficulty levels contain more subconditions, increasing the demand for perception-centric reasoning.
  • Figure 3: Examples from PerceptionComp., , and denote difficulty levels 1, 2, and 3, respectively. PerceptionComp spans diverse video sources and uses subconditions to construct conjunctive and sequential questions that require perception-centric reasoning.
  • Figure 4: Analysis on PerceptionComp. Accuracy as a function of perception budget and reasoning budget. Left/middle: accuracy vs. the number of uniformly sampled input frames for GPT-o3 and Qwen3-VL-8B. Right: accuracy vs. the thinking-token budget for Gemini-2.5-Flash.
  • Figure 5: Example of model reasoning on PerceptionComp. We show responses and judgments of frontier models. Even state-of-the-art models exhibit limitations in capturing perceptual information and often fail to maintain coherent reasoning chains leading to the correct answer.
  • ...and 6 more figures