Table of Contents
Fetching ...

FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

Hugo Caselles-Dupré, Mathis Koroglu, Guillaume Jeanneret, Arnaud Dapogny, Matthieu Cord

Abstract

Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model's native resolution often loses fine-grained structure, whereas high-resolution tiled denoising preserves local detail but breaks global layout consistency. This failure mode is particularly severe in the fresco animation setting: monumental artworks containing many distinct characters, objects, and semantically different sub-scenes that must remain spatially coherent over time. We introduce FrescoDiffusion, a training-free method for coherent large-format I2V generation from a single complex image. The key idea is to augment tiled denoising with a precomputed latent prior: we first generate a low-resolution video at the underlying model resolution and upsample its latent trajectory to obtain a global reference that captures long-range temporal and spatial structure. For 4K generation, we compute per-tile noise predictions and fuse them with this reference at every diffusion timestep by minimizing a single weighted least-squares objective in model-output space. The objective combines a standard tile-merging criterion with our regularization term, yielding a closed-form fusion update that strengthens global coherence while retaining fine detail. We additionally provide a spatial regularization variable that enables region-level control over where motion is allowed. Experiments on the VBench-I2V dataset and our proposed fresco I2V dataset show improved global consistency and fidelity over tiled baselines, while being computationally efficient. Our regularization enables explicit controllability of the trade-off between creativity and consistency.

FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

Abstract

Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model's native resolution often loses fine-grained structure, whereas high-resolution tiled denoising preserves local detail but breaks global layout consistency. This failure mode is particularly severe in the fresco animation setting: monumental artworks containing many distinct characters, objects, and semantically different sub-scenes that must remain spatially coherent over time. We introduce FrescoDiffusion, a training-free method for coherent large-format I2V generation from a single complex image. The key idea is to augment tiled denoising with a precomputed latent prior: we first generate a low-resolution video at the underlying model resolution and upsample its latent trajectory to obtain a global reference that captures long-range temporal and spatial structure. For 4K generation, we compute per-tile noise predictions and fuse them with this reference at every diffusion timestep by minimizing a single weighted least-squares objective in model-output space. The objective combines a standard tile-merging criterion with our regularization term, yielding a closed-form fusion update that strengthens global coherence while retaining fine detail. We additionally provide a spatial regularization variable that enables region-level control over where motion is allowed. Experiments on the VBench-I2V dataset and our proposed fresco I2V dataset show improved global consistency and fidelity over tiled baselines, while being computationally efficient. Our regularization enables explicit controllability of the trade-off between creativity and consistency.
Paper Structure (51 sections, 12 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 51 sections, 12 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: From a single ultra-high-definition image ($3500\times3500$), FrescoDiffusion animates it at the same resolution. We show three frames from the generated video. The red box marks a fixed spatial region tracked across time, illustrating motion temporal consistency and fine-detail preservation.
  • Figure 2: Overview of FrescoDiffusion. Starting from a 4K fresco image, we first build a global latent prior, by resizing the image to the native input size of the image-to-video backbone. Next, we upsample the prior latents $x_{\text{prior}}$ to fit the 4K image size. We then apply tiled denoising to the large latent canvas, $x_t^{\text{4K}}$, obtaining per-tile flow predictions, $\{y_i\}$. We then use $\{y_i\}$ and $x_{\text{prior}}$ to compute the optimal output velocity field (\ref{['eq:closedform']}) according to our loss $\ell_{\text{FD}}$ (\ref{['eq:f2v-energy']}). This updated field is then used to update the large latent canvas, $x_t^{\text{4K}}$, with the flow-matching scheduler.
  • Figure 3: (Top) MSE between the foreground / background regions and the prior. (Bottom) Schedule for both regions.
  • Figure 4: Overlay of the spatial activity map onto the input fresco.
  • Figure 5: A qualitative comparison of fresco-scale inputs. FrescoDiffusion generates coherent global scenes and animates details at a local level. By contrast, DemoFusion, DynamicScaler and MultiDiffusion only manage to produce either coherent scenes or high-quality details, but not both.
  • ...and 8 more figures