Table of Contents
Fetching ...

Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

Kaixin zhang, Xiaohe Li, Jiahao Li, Haohua Wu, Xinyu Zhao, Zide Fan, Lei Wang

Abstract

Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter refines high-order reasoning, alongside lightweight modules for efficient inference. Experiments on NExT-QA, STAR, and MVBench show that ClueNet outperforms state-of-the-art methods by $\ge$ 1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.

Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

Abstract

Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter refines high-order reasoning, alongside lightweight modules for efficient inference. Experiments on NExT-QA, STAR, and MVBench show that ClueNet outperforms state-of-the-art methods by 1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.
Paper Structure (22 sections, 8 equations, 7 figures, 4 tables)

This paper contains 22 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overall framework of the proposed ClueNet, which mimics the hierarchical human visual cognitive process to enable explicit clue-aware video reasoning.
  • Figure 2: MLLM attention heatmaps and paired VideoQA results: misaligned attention strongly correlates with incorrect answers, confirming visual perception bias.
  • Figure 3: Intermediate layers top-probability output token visualization: erroneous visual clue semantic interpretation in intermediate layers propagates to hallucinated reasoning and incorrect final answers, confirming clue cognition bias.
  • Figure 4: t-SNE visualization of question, correct/incorrect candidate answer, reasoning clues, and sampled visual token features: clue-answer feature proximity, causal reasoning utility, and ground-truth clue performance boost confirm inductive reasoning bias.
  • Figure 5: Overview of our two-stage training scheme.
  • ...and 2 more figures