Table of Contents
Fetching ...

Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering

Basura Fernando, Thanh-Son Nguyen, Hong Yang, Tzeh Yuan Neoh, Hao Zhang, Ee Yeo Keat

TL;DR

Knowledge Module Learning (KML) introduces a neurosymbolic framework for procedural knowledge reasoning in videos by learning relation-specific neural modules that map PKG relations to executable programs generated by large language models. The approach decouples program synthesis from execution, grounding module behavior in a Procedural Knowledge Graph (PKG) with explicit relations such as HAS_TOOL and HAS_PURPOSE, enabling interpretable intermediate states and uncertainty-aware multi-hop reasoning. Theoretical results establish a separation condition for learned mappings and a deterministic bound on error accumulation across hops, providing stability guarantees for multi-step reasoning. Empirically, KML outperforms LLM-only and black-box baselines on the PKR-QA benchmark, with ablations and robustness analyses demonstrating the benefits of procedure-grounded grounding, LLM-generated programs, and learned KMs; code is publicly available for reproducibility. The work also extends to logical operators like AND/NOT and discusses future directions toward richer logic and embodied reasoning.

Abstract

In this work we present Knowledge Module Learning (KML) to understand and reason over procedural tasks that requires models to learn structured and compositional procedural knowledge. KML is a neurosymbolic framework that learns relation categories within a knowledge graph as neural knowledge modules and composes them into executable reasoning programs generated by large language models (LLMs). Each module encodes a specific procedural relation capturing how each entity type such as tools are related to steps, purpose of each tool, and steps of each task. Given a question conditioned on a task shown in a video, then KML performs multistep reasoning with transparent, traceable intermediate states. Our theoretical analysis demonstrated two desired properties of KML. KML satisfy strong optimal conditions for modelling KG relations as neural mappings, providing strong foundations for generalizable procedural reasoning. It also shows a bound on the expected error when it performs multistep reasoning. To evaluate this model, we construct a large procedural knowledge graph (PKG) consisting of diverse instructional domains by integrating the COIN instructional video dataset, and COIN ontology, commonsense relations from ConceptNet, and structured extractions from LLMs, followed by expert verification. We then generate question and answer pairs by applying graph traversal templates over the PKG, constructing the PKR-QA benchmark for procedural knowledge reasoning. Experiments show that KML improves structured reasoning performance while providing interpretable step-by-step traces, outperforming LLM-only and black-box neural baselines. Code is publicly available at https://github.com/LUNAProject22/KML.

Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering

TL;DR

Knowledge Module Learning (KML) introduces a neurosymbolic framework for procedural knowledge reasoning in videos by learning relation-specific neural modules that map PKG relations to executable programs generated by large language models. The approach decouples program synthesis from execution, grounding module behavior in a Procedural Knowledge Graph (PKG) with explicit relations such as HAS_TOOL and HAS_PURPOSE, enabling interpretable intermediate states and uncertainty-aware multi-hop reasoning. Theoretical results establish a separation condition for learned mappings and a deterministic bound on error accumulation across hops, providing stability guarantees for multi-step reasoning. Empirically, KML outperforms LLM-only and black-box baselines on the PKR-QA benchmark, with ablations and robustness analyses demonstrating the benefits of procedure-grounded grounding, LLM-generated programs, and learned KMs; code is publicly available for reproducibility. The work also extends to logical operators like AND/NOT and discusses future directions toward richer logic and embodied reasoning.

Abstract

In this work we present Knowledge Module Learning (KML) to understand and reason over procedural tasks that requires models to learn structured and compositional procedural knowledge. KML is a neurosymbolic framework that learns relation categories within a knowledge graph as neural knowledge modules and composes them into executable reasoning programs generated by large language models (LLMs). Each module encodes a specific procedural relation capturing how each entity type such as tools are related to steps, purpose of each tool, and steps of each task. Given a question conditioned on a task shown in a video, then KML performs multistep reasoning with transparent, traceable intermediate states. Our theoretical analysis demonstrated two desired properties of KML. KML satisfy strong optimal conditions for modelling KG relations as neural mappings, providing strong foundations for generalizable procedural reasoning. It also shows a bound on the expected error when it performs multistep reasoning. To evaluate this model, we construct a large procedural knowledge graph (PKG) consisting of diverse instructional domains by integrating the COIN instructional video dataset, and COIN ontology, commonsense relations from ConceptNet, and structured extractions from LLMs, followed by expert verification. We then generate question and answer pairs by applying graph traversal templates over the PKG, constructing the PKR-QA benchmark for procedural knowledge reasoning. Experiments show that KML improves structured reasoning performance while providing interpretable step-by-step traces, outperforming LLM-only and black-box neural baselines. Code is publicly available at https://github.com/LUNAProject22/KML.

Paper Structure

This paper contains 50 sections, 5 theorems, 52 equations, 10 figures, 10 tables.

Key Result

Lemma 1

If $\mathcal{L}(x) \le \varepsilon$, then

Figures (10)

  • Figure 1: (Left) Schema of the Procedural Knowledge Graph (PKG) showing the high-level abstraction of PKG. In the middle are examples of Traversal Templates that define reasoning patterns over PKG to generate question-answer pairs. Corresponding example questions are shown on the right. In the traversal templates, blue text indicates information grounded in the input video, while red text denotes the target answer node.
  • Figure 2: Knowledge Module Learning (KML) for knowledge graph-based procedural video question answering. Given a video and a question, the framework first identifies the entity type in PKG (e.g. "step") that should be detected in the video and then recognizes the entity type instances (e.g., step: "wrap the pipe band") using a VLM. Based on the question and grounded entity type, the answer module generator produces the module sequence. Finally, KML executes the sequence of Knowledge Modules to answer the question. In this example, it first identifies the next step and then determines the tool required for that step, which is a "wrench".
  • Figure 3: Example of logical programs generated by LLM.
  • Figure 4: (Left) QA performance of KML-F-CLIP using top-1 to top-5 grounded entities from P.VRL. (Right) Correlation between step prediction accuracy and QA accuracy.
  • Figure 5: Evaluating the impact of hidden dimension size (left) and the activation function (right) of the Knowledge modules on the validation set.
  • ...and 5 more figures

Theorems & Definitions (8)

  • Lemma 1: Loss bound implies relative mass constraint
  • proof
  • Lemma 2: Bounds on extreme similarities
  • proof
  • Theorem 1: Sufficient condition for KML separation
  • proof
  • Lemma 3: One-step error recursion under Lipschitz learned module).
  • Theorem 2: (Deterministic composition error bound over a $T$-hop traversal).