Table of Contents
Fetching ...

Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion

Edoardo A. Dominici, Thomas Deixelberger, Konstantinos Vardis, Markus Steinberger

Abstract

Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connection between the two has been explored for subject specific editing, aligning and training video diffusion models, but not in the role of a more general conditioning signal for pretrained video diffusion models. Features obtained through self-supervised learning like DINO, contain a lot of entangled information about style, lighting and semantics of the scene. This makes them great at reconstruction tasks but limits their generative capabilities. In this paper, we show how we can use the features for tasks such as video domain transfer and video-from-3D generation. We introduce a lightweight architecture and training strategy that decouples appearance from other features that we wish to preserve, enabling robust control for appearance changes such as stylization and relighting. Furthermore, we show that low spatial resolution can be compensated by higher feature dimensionality, improving controllability in generative rendering from explicit spatial representations.

Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion

Abstract

Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connection between the two has been explored for subject specific editing, aligning and training video diffusion models, but not in the role of a more general conditioning signal for pretrained video diffusion models. Features obtained through self-supervised learning like DINO, contain a lot of entangled information about style, lighting and semantics of the scene. This makes them great at reconstruction tasks but limits their generative capabilities. In this paper, we show how we can use the features for tasks such as video domain transfer and video-from-3D generation. We introduce a lightweight architecture and training strategy that decouples appearance from other features that we wish to preserve, enabling robust control for appearance changes such as stylization and relighting. Furthermore, we show that low spatial resolution can be compensated by higher feature dimensionality, improving controllability in generative rendering from explicit spatial representations.

Paper Structure

This paper contains 28 sections, 4 equations, 24 figures, 6 tables.

Figures (24)

  • Figure 1: We introduce a DINO based ControlNet for Image-to-Video diffusion models that allows using DINOv3 features for structural and semantic guidance, in the image we show: (1) DL3DV style transfer; (2) VKITTI synthetic scene to real weather transfer; (3) Video-from-3DGS. We apply a transfer function on the first frames with FLUX.1 KREA flux1kreadev2025.
  • Figure 2: DINO features are prone to overfitting when simply training a ControlNet. We study the conditioning effect by dropping the first frame conditioning and only relying on the text embedding for guidance. We set the prompt to "blue" and show that in our method we significantly reduce the bias of the ControlNet towards the training data domain.
  • Figure 3: (Left) During training, DINOv3 features from the original video condition the trainable Control-DINO branch, while the frozen backbone denoises appearance-augmented latents of the same scene. (Right) At inference, conditioning features (2D or 3D-rendered) guide generation through Control-DINO. A transferred first frame sets the target appearance.
  • Figure 4: Results for DINO stylization and dramatic light changes (GT Frame is in the bottom left of the image). Our method is able to strongly retain the original geometry and semantics while allowing for appearance and color to change.
  • Figure 5: Visualisation of 3D structure used on ScanNet++ first frames. Top row: voxel RGB renderings. Bottom row: mesh video renderings.
  • ...and 19 more figures