Table of Contents
Fetching ...

Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning

Jingchun Yang, Jinchang Zhang

Abstract

The widespread adoption of dashcams has made video evidence in traffic accidents increasingly abundant, yet transforming "what happened in the video" into "who is responsible under which legal provisions" still relies heavily on human experts. Existing ego-view traffic accident studies mainly focus on perception and semantic understanding, while LLM-based legal methods are mostly built on textual case descriptions and rarely incorporate video evidence, leaving a clear gap between the two. We first propose C-TRAIL, a multimodal legal dataset that, under the Chinese traffic regulation system, explicitly aligns dashcam videos and textual descriptions with a closed set of responsibility modes and their corresponding Chinese traffic statutes. On this basis, we introduce a two-stage framework: (1) a traffic accident understanding module that generates textual video descriptions; and (2) a legal multi-agent framework that outputs responsibility modes, statute sets, and complete judgment reports. Experimental results on C-TRAIL and MM-AU show that our method outperforms general and legal LLMs, as well as existing agent-based approaches, while providing a transparent and interpretable legal reasoning process.

Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning

Abstract

The widespread adoption of dashcams has made video evidence in traffic accidents increasingly abundant, yet transforming "what happened in the video" into "who is responsible under which legal provisions" still relies heavily on human experts. Existing ego-view traffic accident studies mainly focus on perception and semantic understanding, while LLM-based legal methods are mostly built on textual case descriptions and rarely incorporate video evidence, leaving a clear gap between the two. We first propose C-TRAIL, a multimodal legal dataset that, under the Chinese traffic regulation system, explicitly aligns dashcam videos and textual descriptions with a closed set of responsibility modes and their corresponding Chinese traffic statutes. On this basis, we introduce a two-stage framework: (1) a traffic accident understanding module that generates textual video descriptions; and (2) a legal multi-agent framework that outputs responsibility modes, statute sets, and complete judgment reports. Experimental results on C-TRAIL and MM-AU show that our method outperforms general and legal LLMs, as well as existing agent-based approaches, while providing a transparent and interpretable legal reasoning process.
Paper Structure (20 sections, 1 equation, 2 figures, 4 tables)

This paper contains 20 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the judge multi-agent framework. On the left is the multimodal fact construction module: a video understanding module first segments the video into multiple events and generates event-level video descriptions, which are then combined with the accident text annotations provided in the dataset and integrated by the Fact Aggregation Agent into a unified case fact statement. The upper right shows the legal resource retrieval module: based on the fact statement, the Judge Assistant retrieves relevant statutory provisions and typical cases from the traffic law knowledge base and external resources. The lower right shows the judge multi-agent module: the Issue Judge analyzes the case facts and responsibility modes, the Law-Precedent Judge reviews and supplements the applicable statutes and precedents, and the Deliberation Judge consolidates these opinions to produce the final liability determination and judgment.
  • Figure 2: Overall architecture of video understanding and description framework. Given an input video, a Vehicle Ego-Motion Extractor first produces frame-wise ego-motion features, which are fed into a Encoder to obtain contextual feature. On the one hand, an MLP head regresses the vehicle state (speed and steering) from these contextual features; on the other hand, a DETR-style Decoder outputs event feature, which are further processed by an LSTM to generate dense captions together with their corresponding event locations. The Vehicle Ego-Motion Extractor itself is pre-trained in a self-supervised manner by using DepthNet to estimate depth, PoseNet to estimate camera intrinsics, and photometric reconstruction between adjacent frames $I_t$ and $I_{t+1}$.