Table of Contents
Fetching ...

From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors

Liangbing Zhao, Le Zhuo, Sayak Paul, Hongsheng Li, Mohamed Elhoseiny

TL;DR

PhysicEdit is proposed, an end-to-end framework equipped with a textual-visual dual-thinking mechanism that combines a frozen Qwen2.5-VL for physically grounded reasoning with learnable transition queries that provide timestep-adaptive visual guidance to a diffusion backbone.

Abstract

Instruction-based image editing has achieved remarkable success in semantic alignment, yet state-of-the-art models frequently fail to render physically plausible results when editing involves complex causal dynamics, such as refraction or material deformation. We attribute this limitation to the dominant paradigm that treats editing as a discrete mapping between image pairs, which provides only boundary conditions and leaves transition dynamics underspecified. To address this, we reformulate physics-aware editing as predictive physical state transitions and introduce PhysicTran38K, a large-scale video-based dataset comprising 38K transition trajectories across five physical domains, constructed via a two-stage filtering and constraint-aware annotation pipeline. Building on this supervision, we propose PhysicEdit, an end-to-end framework equipped with a textual-visual dual-thinking mechanism. It combines a frozen Qwen2.5-VL for physically grounded reasoning with learnable transition queries that provide timestep-adaptive visual guidance to a diffusion backbone. Experiments show that PhysicEdit improves over Qwen-Image-Edit by 5.9% in physical realism and 10.1% in knowledge-grounded editing, setting a new state-of-the-art for open-source methods, while remaining competitive with leading proprietary models.

From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors

TL;DR

PhysicEdit is proposed, an end-to-end framework equipped with a textual-visual dual-thinking mechanism that combines a frozen Qwen2.5-VL for physically grounded reasoning with learnable transition queries that provide timestep-adaptive visual guidance to a diffusion backbone.

Abstract

Instruction-based image editing has achieved remarkable success in semantic alignment, yet state-of-the-art models frequently fail to render physically plausible results when editing involves complex causal dynamics, such as refraction or material deformation. We attribute this limitation to the dominant paradigm that treats editing as a discrete mapping between image pairs, which provides only boundary conditions and leaves transition dynamics underspecified. To address this, we reformulate physics-aware editing as predictive physical state transitions and introduce PhysicTran38K, a large-scale video-based dataset comprising 38K transition trajectories across five physical domains, constructed via a two-stage filtering and constraint-aware annotation pipeline. Building on this supervision, we propose PhysicEdit, an end-to-end framework equipped with a textual-visual dual-thinking mechanism. It combines a frozen Qwen2.5-VL for physically grounded reasoning with learnable transition queries that provide timestep-adaptive visual guidance to a diffusion backbone. Experiments show that PhysicEdit improves over Qwen-Image-Edit by 5.9% in physical realism and 10.1% in knowledge-grounded editing, setting a new state-of-the-art for open-source methods, while remaining competitive with leading proprietary models.
Paper Structure (55 sections, 5 equations, 10 figures, 9 tables)

This paper contains 55 sections, 5 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Bridging semantic alignment and physical plausibility. (Top) Despite high semantic fidelity, existing editing models frequently violate physical principles. (Bottom) Traditional image editing treats editing as a black box, learning a discrete mapping with underspecified constraints. Our approach reformulates editing as a Physical State Transition, leveraging continuous dynamics to constrain the state transition space from unreal hallucinations to physically valid trajectories.
  • Figure 2: Overview of the PhysicTran38K construction pipeline. Starting from hierarchical physics categories, we synthesize videos using Wan2.2-T2V-A14B, filtered by ViPE with an adaptive strategy to preserve high-dynamic transitions. Candidate videos conduct principle-driven verification by GPT-5-mini, adhering to a rigorous retention rule. Finally, Qwen2.5-VL-7B performs constraint-aware annotation, generating instructions and structured reasoning while incorporating verification results to prevent hallucinations.
  • Figure 3: Overview of the PhysicEdit framework.(a) Training: We distill physical transition priors from video data into learnable transition queries. These queries are supervised by complementary visual features extracted from intermediate keyframes. (b) Inference: PhysicEdit follows a sequential workflow. The frozen MLLM first generates physically-grounded reasoning, which is then concatenated with the learned transition queries to serve as the condition for the diffusion backbone.
  • Figure 4: Qualitative comparison on PICABench. We visualize editing results across diverse physical domains, including Optics, Mechanics, Global State, and Local State. Compared to the backbone Qwen-Image-Edit and proprietary models, PhysicEdit consistently generates more physically plausible and visually natural results, avoiding the physical inconsistencies observed in baseline methods.
  • Figure 5: More detailed illustrations on Mechanical, Biological and Thermal data.
  • ...and 5 more figures