Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine
Minghao Han, Dingkang Yang, Yue Jiang, Yizhou Liu, Lihua Zhang
TL;DR
The paper addresses the brittleness of physical understanding in omni-modal models caused by visually ambiguous physical attributes and data sparsity. It introduces OmniFysics, a compact omni-modal model augmented by FysicsAny for static physical attributes and FysicsOmniCap for dynamic audiovisual data, trained via staged multimodal alignment and latent-space flow matching, plus an intent router for efficient generation. A new holistic benchmark, FysicsEval, and the SA-IAR adaptive reasoning module are proposed to evaluate and constrain physical grounding and cross-modal consistency. Results show OmniFysics achieves competitive performance on standard multimodal benchmarks and shows significant improvements on physics-oriented evaluations, demonstrating that explicit physical knowledge can be effectively injected into omni-modal architectures for more reliable, physics-faithful AI.
Abstract
Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.
