Med-CRAFT: Automated Construction of Interpretable and Multi-Hop Video Workloads via Knowledge Graph Traversal

Shenxi Liu; Kan Li; Mingyang Zhao; Yuhang Tian; Shoujun Zhou; Bin Li

Med-CRAFT: Automated Construction of Interpretable and Multi-Hop Video Workloads via Knowledge Graph Traversal

Shenxi Liu, Kan Li, Mingyang Zhao, Yuhang Tian, Shoujun Zhou, Bin Li

TL;DR

Med-CRAFT introduces a neuro-symbolic pipeline that converts raw medical videos into a dynamic spatiotemporal knowledge graph and uses deterministic graph traversal to synthesize multi-hop, logic-grounded benchmarks. By grounding queries in verifiable graph paths and validating them with adversarial models, it mitigates hallucinations common in end-to-end generation. The instantiated M3-Med-Auto dataset demonstrates that KG-guided workloads achieve expert-level complexity with scalable production and full provenance. This approach offers scalable, low-cost, and interpretable evaluation protocols for medical video understanding systems.

Abstract

The scarcity of high-quality, logically annotated video datasets remains a primary bottleneck in advancing Multi-Modal Large Language Models (MLLMs) for the medical domain. Traditional manual annotation is prohibitively expensive and non-scalable, while existing synthetic methods often suffer from stochastic hallucinations and a lack of logical interpretability. To address these challenges, we introduce \textbf{\PipelineName}, a novel neuro-symbolic data engineering framework that formalizes benchmark synthesis as a deterministic graph traversal process. Unlike black-box generative approaches, Med-CRAFT extracts structured visual primitives (e.g., surgical instruments, anatomical boundaries) from raw video streams and instantiates them into a dynamic Spatiotemporal Knowledge Graph. By anchoring query generation to valid paths within this graph, we enforce a rigorous Chain-of-Thought (CoT) provenance for every synthesized benchmark item. We instantiate this pipeline to produce M3-Med-Auto, a large-scale medical video reasoning benchmark exhibiting fine-grained temporal selectivity and multi-hop logical complexity. Comprehensive evaluations demonstrate that our automated pipeline generates query workloads with complexity comparable to expert-curated datasets. Furthermore, a logic alignment analysis reveals a high correlation between the prescribed graph topology and the reasoning steps of state-of-the-art MLLMs, validating the system's capability to encode verifiable logic into visual-linguistic benchmarks. This work paves the way for scalable, low-cost construction of robust evaluation protocols in critical domains.

Med-CRAFT: Automated Construction of Interpretable and Multi-Hop Video Workloads via Knowledge Graph Traversal

TL;DR

Abstract

Med-CRAFT: Automated Construction of Interpretable and Multi-Hop Video Workloads via Knowledge Graph Traversal

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)