Table of Contents
Fetching ...

ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling

TL;DR

This work introduces ChronoEdit, a foundation model for physically consistent image editing by reframing edits as a short video generation task using pretrained image-to-video diffusion models. A temporal reasoning inference stage inserts intermediate reasoning tokens to imagine plausible, physically viable edit trajectories, then discards them to refine the final frame efficiently. The authors curate a large synthetic video dataset and propose PBench-Edit to evaluate physical consistency in world-simulation contexts. Empirical results show state-of-the-art open-source performance and competitive results with leading proprietary systems, with fast variants like ChronoEdit-Turbo and decisions about reasoning horizon that balance quality and efficiency. Overall, ChronoEdit provides a scalable approach for temporally coherent, physically grounded image edits applicable to autonomous driving, robotics, and other simulation tasks.

Abstract

Recent advances in large generative models have greatly enhanced both image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Project page for code and models: https://research.nvidia.com/labs/toronto-ai/chronoedit

ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

TL;DR

This work introduces ChronoEdit, a foundation model for physically consistent image editing by reframing edits as a short video generation task using pretrained image-to-video diffusion models. A temporal reasoning inference stage inserts intermediate reasoning tokens to imagine plausible, physically viable edit trajectories, then discards them to refine the final frame efficiently. The authors curate a large synthetic video dataset and propose PBench-Edit to evaluate physical consistency in world-simulation contexts. Empirical results show state-of-the-art open-source performance and competitive results with leading proprietary systems, with fast variants like ChronoEdit-Turbo and decisions about reasoning horizon that balance quality and efficiency. Overall, ChronoEdit provides a scalable approach for temporally coherent, physically grounded image edits applicable to autonomous driving, robotics, and other simulation tasks.

Abstract

Recent advances in large generative models have greatly enhanced both image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Project page for code and models: https://research.nvidia.com/labs/toronto-ai/chronoedit

Paper Structure

This paper contains 16 sections, 2 equations, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: Physical consistent image editing results with ChronoEdit-14B. ChronoEdit produces edits that are both visually convincing and physically consistent with the underlying scene context.
  • Figure 2: Failure cases of state-of-the-art image editing models. Current state-of-the-art models often struggle to maintain physical consistency on world simulation-related editing tasks. They may hallucinate unintended objects or distort scene geometry. In contrast, our method produces edits that are faithful to the instruction and remain coherent with the scene. Prompts (from top to bottom): (1) "The left silver SUV makes a U-turn", (2) "Pick up the spoon with the robot arm", and (3) "Close the wooden piece by hand".
  • Figure 3: Overview of the ChronoEdit pipeline. From right to left, the denoising process begins in the temporal reasoning stage, where the model imagines and denoises a short trajectory of intermediate frames. These intermediate frames act as reasoning tokens, guiding how the edit should unfold in a physically consistent manner. For efficiency, the reasoning tokens are discarded in the subsequent editing frame generation stage, where the target frame is further refined into the final edited image.
  • Figure 4: Comparison with baseline methods. The first two rows show examples from the ImageEdit Basic-Edit Suite ye2025imgedit benchmark, and the last row is from PBench-Edit, where ChronoEdit-Think is evaluated with 10 temporal reasoning steps. In both benchmarks, ChronoEdit achieves edits that more faithfully follow the given instructions while preserving scene structure and fine details.
  • Figure 5: Qualitative results on Physical-AI world simulation related tasks. All results are generated by ChronoEdit-14B-Think. Each group shows a reference image (left) and the corresponding output (right). ChronoEdit produces edits that accurately follow the given instructions while preserving scene structure and fine details in Physical AI–related scenes.
  • ...and 8 more figures