Table of Contents
Fetching ...

Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

Minghao Han, Dingkang Yang, Yue Jiang, Yizhou Liu, Lihua Zhang

TL;DR

The paper addresses the brittleness of physical understanding in omni-modal models caused by visually ambiguous physical attributes and data sparsity. It introduces OmniFysics, a compact omni-modal model augmented by FysicsAny for static physical attributes and FysicsOmniCap for dynamic audiovisual data, trained via staged multimodal alignment and latent-space flow matching, plus an intent router for efficient generation. A new holistic benchmark, FysicsEval, and the SA-IAR adaptive reasoning module are proposed to evaluate and constrain physical grounding and cross-modal consistency. Results show OmniFysics achieves competitive performance on standard multimodal benchmarks and shows significant improvements on physics-oriented evaluations, demonstrating that explicit physical knowledge can be effectively injected into omni-modal architectures for more reliable, physics-faithful AI.

Abstract

Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.

Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

TL;DR

The paper addresses the brittleness of physical understanding in omni-modal models caused by visually ambiguous physical attributes and data sparsity. It introduces OmniFysics, a compact omni-modal model augmented by FysicsAny for static physical attributes and FysicsOmniCap for dynamic audiovisual data, trained via staged multimodal alignment and latent-space flow matching, plus an intent router for efficient generation. A new holistic benchmark, FysicsEval, and the SA-IAR adaptive reasoning module are proposed to evaluate and constrain physical grounding and cross-modal consistency. Results show OmniFysics achieves competitive performance on standard multimodal benchmarks and shows significant improvements on physics-oriented evaluations, demonstrating that explicit physical knowledge can be effectively injected into omni-modal architectures for more reliable, physics-faithful AI.

Abstract

Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.
Paper Structure (36 sections, 5 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 36 sections, 5 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: FysicsAny Pipeline. It combines retrieval, analogy reasoning, and physics-law verification for physical-attribute supervision.
  • Figure 2: Overview of OmniFysics and training data distribution.(a) Model architecture. The model employs Temporal Multimodal Rotary Position Embedding to process interleaved sequences of images, audio, and text. For understanding, the Vision and Audio Encoder extract features to feed the LLM backbone. For generation, the Codec and VAE Encoder are utilized to assist the SpokenVoxer and Flow Head in synthesizing audio and imagery. (b) Data distribution for Omni-modal Joint Training. The pie chart illustrates the modal proportions specific to this training stage: image (48%), sound (16%), speech (14%), omni (11%), video (8%), and text-only (3%).
  • Figure 3: Training pipeline of OmniFysics. We implement a four-stage training strategy for the proposed OmniFysics to progressively enhance its omni-modal perception and physical understanding capabilities, including speech and text-to-image generation.
  • Figure 4: Performance of OmniFysics on Image Understanding Benchmarks compared to leading MLLMs under 4B parameters.
  • Figure 5: Physics-aware Generation. Mapping Physical Parameters to Faithful Visual Materials.