Table of Contents
Fetching ...

SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge

Andong Wang, Bo Wu, Sunli Chen, Zhenfang Chen, Haotian Guan, Wei-Ning Lee, Li Erran Li, Chuang Gan

TL;DR

This work presents SOK-Bench, a benchmark for situated open-world commonsense reasoning in videos, addressing the gap where prior datasets rely on static or crowd-sourced knowledge. It introduces an automatic, scalable generation pipeline that builds three aligned knowledge graphs—Situated Knowledge Graph (SKG), General Knowledge Graph (GKG), and Situated Commonsense Knowledge Graph (SCKG)—to produce 44K QA pairs across 12 question types from 10K video clips, each with rationales. The generation relies on iterative LLM/MLLM interactions, Few-Shot Self-Prompting, and careful quality validation, enabling robust bottom-up and top-down QA generation. Experiments with mainstream vision-language models reveal that current systems struggle with the open-world, multi-hop reasoning required by SOK-Bench, highlighting a clear direction for advancing multimodal commonsense reasoning in dynamic real-world contexts.

Abstract

Learning commonsense reasoning from visual contexts and scenes in real-world is a crucial step toward advanced artificial intelligence. However, existing video reasoning benchmarks are still inadequate since they were mainly designed for factual or situated reasoning and rarely involve broader knowledge in the real world. Our work aims to delve deeper into reasoning evaluations, specifically within dynamic, open-world, and structured context knowledge. We propose a new benchmark (SOK-Bench), consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos. The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving. To create such a dataset, we propose an automatic and scalable generation method to generate question-answer pairs, knowledge graphs, and rationales by instructing the combinations of LLMs and MLLMs. Concretely, we first extract observable situated entities, relations, and processes from videos for situated knowledge and then extend to open-world knowledge beyond the visible content. The task generation is facilitated through multiple dialogues as iterations and subsequently corrected and refined by our designed self-promptings and demonstrations. With a corpus of both explicit situated facts and implicit commonsense, we generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance. We evaluated recent mainstream large vision-language models on the benchmark and found several insightful conclusions. For more information, please refer to our benchmark at www.bobbywu.com/SOKBench.

SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge

TL;DR

This work presents SOK-Bench, a benchmark for situated open-world commonsense reasoning in videos, addressing the gap where prior datasets rely on static or crowd-sourced knowledge. It introduces an automatic, scalable generation pipeline that builds three aligned knowledge graphs—Situated Knowledge Graph (SKG), General Knowledge Graph (GKG), and Situated Commonsense Knowledge Graph (SCKG)—to produce 44K QA pairs across 12 question types from 10K video clips, each with rationales. The generation relies on iterative LLM/MLLM interactions, Few-Shot Self-Prompting, and careful quality validation, enabling robust bottom-up and top-down QA generation. Experiments with mainstream vision-language models reveal that current systems struggle with the open-world, multi-hop reasoning required by SOK-Bench, highlighting a clear direction for advancing multimodal commonsense reasoning in dynamic real-world contexts.

Abstract

Learning commonsense reasoning from visual contexts and scenes in real-world is a crucial step toward advanced artificial intelligence. However, existing video reasoning benchmarks are still inadequate since they were mainly designed for factual or situated reasoning and rarely involve broader knowledge in the real world. Our work aims to delve deeper into reasoning evaluations, specifically within dynamic, open-world, and structured context knowledge. We propose a new benchmark (SOK-Bench), consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos. The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving. To create such a dataset, we propose an automatic and scalable generation method to generate question-answer pairs, knowledge graphs, and rationales by instructing the combinations of LLMs and MLLMs. Concretely, we first extract observable situated entities, relations, and processes from videos for situated knowledge and then extend to open-world knowledge beyond the visible content. The task generation is facilitated through multiple dialogues as iterations and subsequently corrected and refined by our designed self-promptings and demonstrations. With a corpus of both explicit situated facts and implicit commonsense, we generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance. We evaluated recent mainstream large vision-language models on the benchmark and found several insightful conclusions. For more information, please refer to our benchmark at www.bobbywu.com/SOKBench.
Paper Structure (18 sections, 5 figures, 4 tables)

This paper contains 18 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview. Instead of using crowd-sourced methods, we design a synthesis pipeline to create the benchmark by leveraging use LLMs and VLLMs, improving efficiency and ensuring consistency. The method helps to automatically generate high-quality question-answers (QAs) and focusing the desirable purposes for evaluating the model's ability. To generate data aligned with open-world knowledge, we propose to connect situation, general knowledge, and situated commonsense and produced three types of associated knowledge graphs (refer to the subsection \ref{['sec:method_situated_knowledge_graph']}, \ref{['sec:method_general_knowledge_graph']}, and \ref{['sec:method_situated_commonsense_knowledge_graph']}). Specifically, it makes more precise inferences based on situational facts and essential commonsense knowledge by aligning with the bottom-up or top-down goals, the reasoning process from Q to A is able to demonstrate explicitly.
  • Figure 2: SOK-Bench data examples. Each QA pair corresponds to a video clip (e.g., a video clip showing how to cook California Rolls) and a type of situated commonsense knowledge (e.g., action + temporal + purpose). For each question, we provide four options, the correct choice, and the associated situated commonsense graphs.
  • Figure 3: (a) Sankey diagram of the 12 question types. (b) Answer distribution among options for each question type. Meaning of abbreviations: O: Object; A: Action; CT: Counterfactual; CB: Contribution; PU: Purpose; I: Inference; PO: Possibility; ST: Spatiotemporal; GK: General knowledge. Notably, the "Spatiotemporal" includes "obj attributes", "obj-obj relations", "obj attribute + obj-obj relation", and "before/after action" (see Section \ref{['sec:method_qa_gen']}).
  • Figure 4: Generation pipeline of Situated Knowledge Graph (SKG, see Section \ref{['sec:method_situated_knowledge_graph']}), General Knowledge Graph (GKG, see Section \ref{['sec:method_general_knowledge_graph']}), and Situated Commonsense Knowledge Graph (SCKG, see Section \ref{['sec:method_situated_commonsense_knowledge_graph']}).
  • Figure 5: GPT4v's ability to perform complex combined spatiotemporal and situated commonsense reasoning is limited. The model needs to do two-hog reasoning, i.e., understanding the purpose of "adding chopped onions" is to "enhance the flavor" while knowing the previous action is "pouring oil".