Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

Minghao Han; Dingkang Yang; Yue Jiang; Yizhou Liu; Lihua Zhang

Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

Minghao Han, Dingkang Yang, Yue Jiang, Yizhou Liu, Lihua Zhang

TL;DR

The paper addresses the brittleness of physical understanding in omni-modal models caused by visually ambiguous physical attributes and data sparsity. It introduces OmniFysics, a compact omni-modal model augmented by FysicsAny for static physical attributes and FysicsOmniCap for dynamic audiovisual data, trained via staged multimodal alignment and latent-space flow matching, plus an intent router for efficient generation. A new holistic benchmark, FysicsEval, and the SA-IAR adaptive reasoning module are proposed to evaluate and constrain physical grounding and cross-modal consistency. Results show OmniFysics achieves competitive performance on standard multimodal benchmarks and shows significant improvements on physics-oriented evaluations, demonstrating that explicit physical knowledge can be effectively injected into omni-modal architectures for more reliable, physics-faithful AI.

Abstract

Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.

Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

TL;DR

Abstract

Paper Structure (36 sections, 5 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 36 sections, 5 equations, 5 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Omni-Modal Foundation Models
Physics AI and Benchmarks
Data Curation Engine
Multifaceted Physical Attribute Mapping Pipeline
Dynamic Omni Alignment Engine
FysicsEval: Physical Intelligence Benchmark
Methodology
Model Architecture
Training Strategy
Modality-Specific Training
Omni-modal Joint Training
Audio Generation Training
Image Generation Training
...and 21 more sections

Figures (5)

Figure 1: FysicsAny Pipeline. It combines retrieval, analogy reasoning, and physics-law verification for physical-attribute supervision.
Figure 2: Overview of OmniFysics and training data distribution.(a) Model architecture. The model employs Temporal Multimodal Rotary Position Embedding to process interleaved sequences of images, audio, and text. For understanding, the Vision and Audio Encoder extract features to feed the LLM backbone. For generation, the Codec and VAE Encoder are utilized to assist the SpokenVoxer and Flow Head in synthesizing audio and imagery. (b) Data distribution for Omni-modal Joint Training. The pie chart illustrates the modal proportions specific to this training stage: image (48%), sound (16%), speech (14%), omni (11%), video (8%), and text-only (3%).
Figure 3: Training pipeline of OmniFysics. We implement a four-stage training strategy for the proposed OmniFysics to progressively enhance its omni-modal perception and physical understanding capabilities, including speech and text-to-image generation.
Figure 4: Performance of OmniFysics on Image Understanding Benchmarks compared to leading MLLMs under 4B parameters.
Figure 5: Physics-aware Generation. Mapping Physical Parameters to Faithful Visual Materials.

Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

TL;DR

Abstract

Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

Authors

TL;DR

Abstract

Table of Contents

Figures (5)