Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

Yao Gu; Xiaohao Xu; Yingna Wu

Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

Yao Gu, Xiaohao Xu, Yingna Wu

Abstract

Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection--substantially outperforming prior SOTA (66.9%)--and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.

Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

Abstract

Paper Structure (12 sections, 2 equations, 4 figures, 3 tables)

This paper contains 12 sections, 2 equations, 4 figures, 3 tables.

Introduction
Related Work
Methodology
Problem Formulation
Structuring Physics as Textual Priors
Learning via Multi-Turn Dialogue
Inference and Implementation
Experiments
Experimental Setup
Results
Ablation Study
Conclusion

Figures (4)

Figure 1: Comparison of (a) conventional instruction fine-tuning and (b) our multi-turn physics-informed approach. Our method uses multi-turn dialogue to incorporate structured physical information as prior knowledge, enhancing the LLM's autoregressive generation with improved logic and structure. This universal physical prior reduces dataset construction burden.
Figure 2: The construction process of our structured physics information. This explicit knowledge base, $\mathcal{P}_c=(S_{com}, S_{dyn}, S_{mot})$, guides the model through a logical reasoning chain for physics-grounded anomaly detection.
Figure 3: Our model architecture for physics-grounded anomaly detection and explanation. Based on Video-LLaVA, only the LLM components (QKV matrices and FFN layer) are fine-tuned, while the vision tower and multi-modal projector remain frozen. The multi-turn dialogue injects physical priors, enabling robust anomaly detection and causal explanation in two stages.
Figure 4: Qualitative comparison of anomaly explanation methods on a 'leaking ball' scene. Our multi-turn physics-informed VLM provides a more accurate and causally relevant explanation than prior SOTA methods.

Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

Abstract

Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

Authors

Abstract

Table of Contents

Figures (4)