Table of Contents
Fetching ...

DreamColour: Controllable Video Colour Editing without Training

Chaitat Utintu, Pinaki Nath Chowdhury, Aneeshan Sain, Subhadeep Koley, Ayan Kumar Bhunia, Yi-Zhe Song

TL;DR

DreamColour tackles the challenge of training‑free, temporally coherent video colour editing by decoupling spatial colour edits from temporal propagation. It combines a grid‑based, instance‑aware intra‑frame editing stage with bidirectional diffusion priors and spatio‑temporal feature injection to propagate edits across frames without retraining. Key contributions include a SAM2‑guided UniColor masking for precise region control, DDIM inversion with BLIP‑2 semantics, and a forward‑backward propagation framework that yields smooth, artifact‑free colour transitions in complex scenes. The approach delivers professional‑quality results on diverse videos using only pre‑trained components, enabling accessible colour editing without specialised hardware. The documented ablations and comparisons show improved boundary fidelity, temporal consistency, and qualitative appeal relative to zero‑shot baselines.

Abstract

Video colour editing is a crucial task for content creation, yet existing solutions either require painstaking frame-by-frame manipulation or produce unrealistic results with temporal artefacts. We present a practical, training-free framework that makes precise video colour editing accessible through an intuitive interface while maintaining professional-quality output. Our key insight is that by decoupling spatial and temporal aspects of colour editing, we can better align with users' natural workflow -- allowing them to focus on precise colour selection in key frames before automatically propagating changes across time. We achieve this through a novel technical framework that combines: (i) a simple point-and-click interface merging grid-based colour selection with automatic instance segmentation for precise spatial control, (ii) bidirectional colour propagation that leverages inherent video motion patterns, and (iii) motion-aware blending that ensures smooth transitions even with complex object movements. Through extensive evaluation on diverse scenarios, we demonstrate that our approach matches or exceeds state-of-the-art methods while eliminating the need for training or specialized hardware, making professional-quality video colour editing accessible to everyone.

DreamColour: Controllable Video Colour Editing without Training

TL;DR

DreamColour tackles the challenge of training‑free, temporally coherent video colour editing by decoupling spatial colour edits from temporal propagation. It combines a grid‑based, instance‑aware intra‑frame editing stage with bidirectional diffusion priors and spatio‑temporal feature injection to propagate edits across frames without retraining. Key contributions include a SAM2‑guided UniColor masking for precise region control, DDIM inversion with BLIP‑2 semantics, and a forward‑backward propagation framework that yields smooth, artifact‑free colour transitions in complex scenes. The approach delivers professional‑quality results on diverse videos using only pre‑trained components, enabling accessible colour editing without specialised hardware. The documented ablations and comparisons show improved boundary fidelity, temporal consistency, and qualitative appeal relative to zero‑shot baselines.

Abstract

Video colour editing is a crucial task for content creation, yet existing solutions either require painstaking frame-by-frame manipulation or produce unrealistic results with temporal artefacts. We present a practical, training-free framework that makes precise video colour editing accessible through an intuitive interface while maintaining professional-quality output. Our key insight is that by decoupling spatial and temporal aspects of colour editing, we can better align with users' natural workflow -- allowing them to focus on precise colour selection in key frames before automatically propagating changes across time. We achieve this through a novel technical framework that combines: (i) a simple point-and-click interface merging grid-based colour selection with automatic instance segmentation for precise spatial control, (ii) bidirectional colour propagation that leverages inherent video motion patterns, and (iii) motion-aware blending that ensures smooth transitions even with complex object movements. Through extensive evaluation on diverse scenarios, we demonstrate that our approach matches or exceeds state-of-the-art methods while eliminating the need for training or specialized hardware, making professional-quality video colour editing accessible to everyone.

Paper Structure

This paper contains 39 sections, 4 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Our training-free framework enables intuitive video colour editing in two stages. First, users simply select colours from a $16\times16$ grid to edit any frame, with automatic instance segmentation ravi2024sam2 preventing colour bleeding. Then, our bidirectional propagation mechanism, combining temporal attention ho2022videodiffusionmodels and motion-aware blending szeliski2021CV, ensures smooth colour transitions across frames. This approach enables flexible editing scenarios: from single to multiple regions, and from any frame in the sequence, while maintaining temporal consistency through careful integration of diffusion inversion song2022denoisingdiff and instance-aware colour control cong2024colourisation.
  • Figure 2: Our single-region colour editing begins with greyscale conversion and superpixel generation to create structural foundation and initial colour hints, respectively. User-defined hints are refined with SAM2 instance segmentation, creating an accurate object mask that guides UniColor to produce an edited frame with targeted colour applications and preserve the unselected regions.
  • Figure 3: Multi-region colour editing pipeline using SAM2 and UniColor. SAM2's point-based instance segmentation applies positive and negative prompts to generate masks for each selected region, preventing unintended colour spillover. The combined and refined colour hints are then processed by UniColor to produce an edited frame with well-defined local colour consistency.
  • Figure 4: The primary pathway (top) performs DDIM inversion on the reference video $\mathcal{V}$, generating latent noise $z_{t}^{\mathcal{V}}$ to capture motion and structural cues. The secondary pathway (bottom) starts with the edited frame $\mathcal{I}_{edited}$ and random noise $z_{t}^{\ast}$, injecting spatio-temporal features from the primary pathway for coherence. BLIP-2 provides textual descriptions, enhancing semantic consistency and colour fidelity in the generated video.
  • Figure 5: To edit the $m^{th}$ intermediate frame, the video is divided into forward ($\mathcal{I}_{m} \to \mathcal{I}_{n}$) and backward ($\mathcal{I}_{m} \to \mathcal{I}_{1}$) subsequences. First-frame colour editing is then applied separately in each direction, with colour changes propagated through denoising steps. The edited segments are then combined to create a fully edited video.
  • ...and 10 more figures