Table of Contents
Fetching ...

iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

Manyi Yao, Bingbing Zhuang, Sparsh Garg, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker, Abhishek Aich

TL;DR

This work tackles the challenge of grounding general-purpose LLMs for dash-cam driving video analysis by decoupling perception from reasoning and grounding the LLM with structured, domain-specific cues. It introduces iFinder, a modular, training-free pipeline that extracts object poses, lane contexts, distances, and 3D orientations using pretrained vision models, then feeds a hierarchical JSON representation and a three-block prompt into a capable LLM, with a peer V-VLM providing an initial hypothesis for refinement. Across four public benchmarks in zero-shot settings, iFinder outperforms both generalist and driving-specialized V-VLMs, notably achieving large gains in accident reasoning and showcasing robustness under adverse conditions. The study highlights the importance of explicit symbolic cues, ego-centric context, and hierarchical grounding for reliable, interpretable post-hoc driving video understanding, offering a practical alternative to end-to-end V-VLMs in safety-critical analysis.

Abstract

Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference, and explainability of events in the input video. To this end, we introduce iFinder, a structured semantic grounding framework that decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure for LLMs. iFinder operates as a modular, training-free pipeline that employs pretrained vision models to extract critical cues -- object pose, lane positions, and object trajectories -- which are hierarchically organized into frame- and video-level structures. Combined with a three-block prompting strategy, it enables step-wise, grounded reasoning for the LLM to refine a peer V-VLM's outputs and provide accurate reasoning. Evaluations on four public dash-cam video benchmarks show that iFinder's proposed grounding with domain-specific cues, especially object orientation and global context, significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. By grounding LLMs with driving domain-specific representations, iFinder offers a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding.

iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

TL;DR

This work tackles the challenge of grounding general-purpose LLMs for dash-cam driving video analysis by decoupling perception from reasoning and grounding the LLM with structured, domain-specific cues. It introduces iFinder, a modular, training-free pipeline that extracts object poses, lane contexts, distances, and 3D orientations using pretrained vision models, then feeds a hierarchical JSON representation and a three-block prompt into a capable LLM, with a peer V-VLM providing an initial hypothesis for refinement. Across four public benchmarks in zero-shot settings, iFinder outperforms both generalist and driving-specialized V-VLMs, notably achieving large gains in accident reasoning and showcasing robustness under adverse conditions. The study highlights the importance of explicit symbolic cues, ego-centric context, and hierarchical grounding for reliable, interpretable post-hoc driving video understanding, offering a practical alternative to end-to-end V-VLMs in safety-critical analysis.

Abstract

Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference, and explainability of events in the input video. To this end, we introduce iFinder, a structured semantic grounding framework that decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure for LLMs. iFinder operates as a modular, training-free pipeline that employs pretrained vision models to extract critical cues -- object pose, lane positions, and object trajectories -- which are hierarchically organized into frame- and video-level structures. Combined with a three-block prompting strategy, it enables step-wise, grounded reasoning for the LLM to refine a peer V-VLM's outputs and provide accurate reasoning. Evaluations on four public dash-cam video benchmarks show that iFinder's proposed grounding with domain-specific cues, especially object orientation and global context, significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. By grounding LLMs with driving domain-specific representations, iFinder offers a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding.

Paper Structure

This paper contains 42 sections, 4 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Advantages of $\bm{\dot{\imath}}$Finder. Baselines VideoLLaMA2 damonlpsg2024videollama2, VideoLLaVA lin2023video, and DriveMM huang2024drivemm struggle with spatial reasoning, and fine-grained scene understanding, misinterpreting critical cues. $\bm{\dot{\imath}}$Finder's structured scene approach mitigates these errors for more accurate responses.
  • Figure 2: $\bm{\dot{\imath}}$Finder overview. The proposed pipeline transforms key scene properties such as object detection, lane detection, depth estimation, and ego-state estimation, into structured data, which, combined with peer-generated insights, enables the LLM to perform accurate and interpretable driving scenario analysis.
  • Figure 3: Lane location estimation. Detected objects are assigned a lane by mapping the bottom midpoint of the corresponding bounding box (bottom middle point of image for ego) to sections identified by the lane detection model.
  • Figure 4: Distance estimation. Each object's distance is determined by averaging the depth values within its segmented region.
  • Figure 5: Qualitative comparison on LingoQA (top), MM-AU (bottom, left), and Nexar (bottom, right) dataset.$\bm{\dot{\imath}}$Finder improves spatial reasoning and causal inference, and reduces peer-V-VLM errors. In the bottom-left example, $\bm{\dot{\imath}}$Finder corrects the peer V-VLM’s inaccurate claim of a "decelerated vehicle" by leveraging structured data that reveals the ego vehicle's rapid approach.
  • ...and 4 more figures