Table of Contents
Fetching ...

Med-CRAFT: Automated Construction of Interpretable and Multi-Hop Video Workloads via Knowledge Graph Traversal

Shenxi Liu, Kan Li, Mingyang Zhao, Yuhang Tian, Shoujun Zhou, Bin Li

TL;DR

Med-CRAFT introduces a neuro-symbolic pipeline that converts raw medical videos into a dynamic spatiotemporal knowledge graph and uses deterministic graph traversal to synthesize multi-hop, logic-grounded benchmarks. By grounding queries in verifiable graph paths and validating them with adversarial models, it mitigates hallucinations common in end-to-end generation. The instantiated M3-Med-Auto dataset demonstrates that KG-guided workloads achieve expert-level complexity with scalable production and full provenance. This approach offers scalable, low-cost, and interpretable evaluation protocols for medical video understanding systems.

Abstract

The scarcity of high-quality, logically annotated video datasets remains a primary bottleneck in advancing Multi-Modal Large Language Models (MLLMs) for the medical domain. Traditional manual annotation is prohibitively expensive and non-scalable, while existing synthetic methods often suffer from stochastic hallucinations and a lack of logical interpretability. To address these challenges, we introduce \textbf{\PipelineName}, a novel neuro-symbolic data engineering framework that formalizes benchmark synthesis as a deterministic graph traversal process. Unlike black-box generative approaches, Med-CRAFT extracts structured visual primitives (e.g., surgical instruments, anatomical boundaries) from raw video streams and instantiates them into a dynamic Spatiotemporal Knowledge Graph. By anchoring query generation to valid paths within this graph, we enforce a rigorous Chain-of-Thought (CoT) provenance for every synthesized benchmark item. We instantiate this pipeline to produce M3-Med-Auto, a large-scale medical video reasoning benchmark exhibiting fine-grained temporal selectivity and multi-hop logical complexity. Comprehensive evaluations demonstrate that our automated pipeline generates query workloads with complexity comparable to expert-curated datasets. Furthermore, a logic alignment analysis reveals a high correlation between the prescribed graph topology and the reasoning steps of state-of-the-art MLLMs, validating the system's capability to encode verifiable logic into visual-linguistic benchmarks. This work paves the way for scalable, low-cost construction of robust evaluation protocols in critical domains.

Med-CRAFT: Automated Construction of Interpretable and Multi-Hop Video Workloads via Knowledge Graph Traversal

TL;DR

Med-CRAFT introduces a neuro-symbolic pipeline that converts raw medical videos into a dynamic spatiotemporal knowledge graph and uses deterministic graph traversal to synthesize multi-hop, logic-grounded benchmarks. By grounding queries in verifiable graph paths and validating them with adversarial models, it mitigates hallucinations common in end-to-end generation. The instantiated M3-Med-Auto dataset demonstrates that KG-guided workloads achieve expert-level complexity with scalable production and full provenance. This approach offers scalable, low-cost, and interpretable evaluation protocols for medical video understanding systems.

Abstract

The scarcity of high-quality, logically annotated video datasets remains a primary bottleneck in advancing Multi-Modal Large Language Models (MLLMs) for the medical domain. Traditional manual annotation is prohibitively expensive and non-scalable, while existing synthetic methods often suffer from stochastic hallucinations and a lack of logical interpretability. To address these challenges, we introduce \textbf{\PipelineName}, a novel neuro-symbolic data engineering framework that formalizes benchmark synthesis as a deterministic graph traversal process. Unlike black-box generative approaches, Med-CRAFT extracts structured visual primitives (e.g., surgical instruments, anatomical boundaries) from raw video streams and instantiates them into a dynamic Spatiotemporal Knowledge Graph. By anchoring query generation to valid paths within this graph, we enforce a rigorous Chain-of-Thought (CoT) provenance for every synthesized benchmark item. We instantiate this pipeline to produce M3-Med-Auto, a large-scale medical video reasoning benchmark exhibiting fine-grained temporal selectivity and multi-hop logical complexity. Comprehensive evaluations demonstrate that our automated pipeline generates query workloads with complexity comparable to expert-curated datasets. Furthermore, a logic alignment analysis reveals a high correlation between the prescribed graph topology and the reasoning steps of state-of-the-art MLLMs, validating the system's capability to encode verifiable logic into visual-linguistic benchmarks. This work paves the way for scalable, low-cost construction of robust evaluation protocols in critical domains.

Paper Structure

This paper contains 14 sections, 10 equations, 7 figures.

Figures (7)

  • Figure 1: The basic structure of Med-CRAFT, including visual extraction layer (pixel-level), graph construction layer (semantic-level), query synthesis layer (logic-level)
  • Figure 2: Visual Primitive Extraction: Segmenting and tracking surgical instruments and anatomical structures to form spatiotemporal tubelets.
  • Figure 3: Dynamic Knowledge Graph Construction: Instantiating a symbolic graph where nodes represent visual entities and edges encode temporal interactions.
  • Figure 4: Logic-Guided Query Synthesis: Generating complex, multi-hop QA pairs via deterministic graph traversal, ensuring strict alignment between visual evidence and textual logic.
  • Figure 5: In the results of Task 1, we compare our retrieval performance with state-of-the-art baselines and control groups. Our method significantly outperforms existing datasets on challenging tasks, especially in scenarios requiring precise time alignment.
  • ...and 2 more figures