Table of Contents
Fetching ...

RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation

Sen Zhang, Runmei Li, Zhichao Zheng, Yuhe Zhang, Jiani Li, Kailun Zhang, Tao Zhang, Wenjun Wu, Qunbo Wang

Abstract

Automatic Train Operation (ATO) relies on low-latency, reliable cab-view visual perception and decision-oriented inference to ensure safe operation in complex and dynamic railway environments. However, existing approaches focus primarily on basic perception and often generalize poorly to rare yet safety-critical corner cases. They also lack the high-level reasoning and planning capabilities required for operational decision-making. Although recent Large Multi-modal Models (LMMs) show strong generalization and cognitive capabilities, their use in safety-critical ATO is hindered by high computational cost and hallucination risk. Meanwhile, reliable domain-specific benchmarks for systematically evaluating cognitive capabilities are still lacking. To address these gaps, we introduce RailVQA-bench, the first VQA benchmark for cab-view visual cognition in ATO, comprising 20,000 single-frame and 1,168 video based QA pairs to evaluate cognitive generalization and interpretability in both static and dynamic scenarios. Furthermore, we propose RailVQA-CoM, a collaborative large-small model framework that combines small-model efficiency with large-model cognition via a transparent three-module architecture and adaptive temporal sampling, improving perceptual generalization and enabling efficient reasoning and planning. Experiments demonstrate that the proposed approach substantially improves performance, enhances interpretability, reduces inference latency, and strengthens cross-domain generalization, while enabling plug-and-play deployment in autonomous driving systems. Code and datasets will be available at https://github.com/Cybereye-bjtu/RailVQA.

RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation

Abstract

Automatic Train Operation (ATO) relies on low-latency, reliable cab-view visual perception and decision-oriented inference to ensure safe operation in complex and dynamic railway environments. However, existing approaches focus primarily on basic perception and often generalize poorly to rare yet safety-critical corner cases. They also lack the high-level reasoning and planning capabilities required for operational decision-making. Although recent Large Multi-modal Models (LMMs) show strong generalization and cognitive capabilities, their use in safety-critical ATO is hindered by high computational cost and hallucination risk. Meanwhile, reliable domain-specific benchmarks for systematically evaluating cognitive capabilities are still lacking. To address these gaps, we introduce RailVQA-bench, the first VQA benchmark for cab-view visual cognition in ATO, comprising 20,000 single-frame and 1,168 video based QA pairs to evaluate cognitive generalization and interpretability in both static and dynamic scenarios. Furthermore, we propose RailVQA-CoM, a collaborative large-small model framework that combines small-model efficiency with large-model cognition via a transparent three-module architecture and adaptive temporal sampling, improving perceptual generalization and enabling efficient reasoning and planning. Experiments demonstrate that the proposed approach substantially improves performance, enhances interpretability, reduces inference latency, and strengthens cross-domain generalization, while enabling plug-and-play deployment in autonomous driving systems. Code and datasets will be available at https://github.com/Cybereye-bjtu/RailVQA.

Paper Structure

This paper contains 23 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of the proposed RailVQA-CoM framework, which consists of three hierarchical modules: (1) a Perception module that efficiently extracts visual primitives; (2) a Motion Analysis and Memory Log module that captures motion patterns and maintains temporal context for dynamic scenes; and (3) a Cognitive Inference module that performs high-level reasoning and supports decision-oriented inference.
  • Figure 2: This figure presents the standardized input–output schemas for the benchmark’s two core subtasks: Static Single-frame VQA and Dynamic Multi-frame VQA. Given a visual input—either a single frame $I$ or a video sequence $S$—and its associated question $Q$, the model output is required to follow a predefined, structured chain-of-thought (CoT) format.
  • Figure 3: Comprehensive statistical overview of the RailVQA-bench dataset. (a) shows the distribution of generated CoT character lengths, reflecting the benchmark’s emphasis on logic-intensive reasoning. (b) reports the occurrence frequencies of key railway entities, demonstrating broad domain-specific coverage. (c) summarizes the distribution of question intents, indicating a primary focus on action planning and safety-related decisions rather than simple perception.
  • Figure 4: Performance-efficiency comparison in dynamic scenarios. RailVQA-CoM achieves simultaneous substantial gains, pushing the performance towards the top-right by reducing latency and enhancing cognitive scores.
  • Figure 5: Ablation study of the core middleware components of RailVQA-CoM. To intuitively demonstrate the impact of each module on model performance, we extract three representative metrics (Overall Score, Risk Assessment, and Physics & Momentum) from the ablation table to construct this bar chart.
  • ...and 1 more figures