Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

Kaixin zhang; Xiaohe Li; Jiahao Li; Haohua Wu; Xinyu Zhao; Zide Fan; Lei Wang

Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

Kaixin zhang, Xiaohe Li, Jiahao Li, Haohua Wu, Xinyu Zhao, Zide Fan, Lei Wang

Abstract

Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter refines high-order reasoning, alongside lightweight modules for efficient inference. Experiments on NExT-QA, STAR, and MVBench show that ClueNet outperforms state-of-the-art methods by $\ge$ 1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.

Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

Abstract

1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.

Paper Structure (22 sections, 8 equations, 7 figures, 4 tables)

This paper contains 22 sections, 8 equations, 7 figures, 4 tables.

Introduction
Related Work
Question Analysis
Visual Perception Bias.
Clue Cognition Bias.
Inductive Reasoning Bias.
Method
Framework Overview
Clue Cognizer
Stage 1: Decoupled Supervision of Clue Discovery and Reasoning
Stage 2: Adaptive Clue Filter and Inference Supervision
Inference Process
Theoretical Analysis
Experiments
Experimental Setup
...and 7 more sections

Figures (7)

Figure 1: Overall framework of the proposed ClueNet, which mimics the hierarchical human visual cognitive process to enable explicit clue-aware video reasoning.
Figure 2: MLLM attention heatmaps and paired VideoQA results: misaligned attention strongly correlates with incorrect answers, confirming visual perception bias.
Figure 3: Intermediate layers top-probability output token visualization: erroneous visual clue semantic interpretation in intermediate layers propagates to hallucinated reasoning and incorrect final answers, confirming clue cognition bias.
Figure 4: t-SNE visualization of question, correct/incorrect candidate answer, reasoning clues, and sampled visual token features: clue-answer feature proximity, causal reasoning utility, and ground-truth clue performance boost confirm inductive reasoning bias.
Figure 5: Overview of our two-stage training scheme.
...and 2 more figures

Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

Abstract

Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

Authors

Abstract

Table of Contents

Figures (7)