Table of Contents
Fetching ...

TC-Light: Temporally Coherent Generative Rendering for Realistic World Transfer

Yang Liu, Chuanchen Luo, Zimo Tang, Yingyan Li, Yuran Yang, Yuanyong Ning, Lue Fan, Junran Peng, Zhaoxiang Zhang

TL;DR

TC-Light presents a temporally coherent relighting framework for long, dynamic videos by inflating a strong image relighting model to video space and applying a two-stage post-optimization. The core is a canonical Unique Video Tensor (UVT) representation that compresses spatiotemporal information and enables efficient, coherent optimization, augmented by a decayed multi-axis denoising approach during video-space diffusion. Stage I exposure alignment and Stage II UVT refinement jointly reduce illumination and texture flicker, delivering physically plausible results at low computational cost. The method achieves state-of-the-art temporal coherence on a challenging long-video benchmark and shows strong performance across synthetic and real-world scenarios, with potential impact for sim2real, real2real, and embodied AI data generation.

Abstract

Illumination and texture editing are critical dimensions for world-to-world transfer, which is valuable for applications including sim2real and real2real visual data scaling up for embodied AI. Existing techniques generatively re-render the input video to realize the transfer, such as video relighting models and conditioned world generation models. Nevertheless, these models are predominantly limited to the domain of training data (e.g., portrait) or fall into the bottleneck of temporal consistency and computation efficiency, especially when the input video involves complex dynamics and long durations. In this paper, we propose TC-Light, a novel generative renderer to overcome these problems. Starting from the video preliminarily relighted by an inflated video relighting model, it optimizes appearance embedding in the first stage to align global illumination. Then it optimizes the proposed canonical video representation, i.e., Unique Video Tensor (UVT), to align fine-grained texture and lighting in the second stage. To comprehensively evaluate performance, we also establish a long and highly dynamic video benchmark. Extensive experiments show that our method enables physically plausible re-rendering results with superior temporal coherence and low computation cost. The code and video demos are available at https://dekuliutesla.github.io/tclight/.

TC-Light: Temporally Coherent Generative Rendering for Realistic World Transfer

TL;DR

TC-Light presents a temporally coherent relighting framework for long, dynamic videos by inflating a strong image relighting model to video space and applying a two-stage post-optimization. The core is a canonical Unique Video Tensor (UVT) representation that compresses spatiotemporal information and enables efficient, coherent optimization, augmented by a decayed multi-axis denoising approach during video-space diffusion. Stage I exposure alignment and Stage II UVT refinement jointly reduce illumination and texture flicker, delivering physically plausible results at low computational cost. The method achieves state-of-the-art temporal coherence on a challenging long-video benchmark and shows strong performance across synthetic and real-world scenarios, with potential impact for sim2real, real2real, and embodied AI data generation.

Abstract

Illumination and texture editing are critical dimensions for world-to-world transfer, which is valuable for applications including sim2real and real2real visual data scaling up for embodied AI. Existing techniques generatively re-render the input video to realize the transfer, such as video relighting models and conditioned world generation models. Nevertheless, these models are predominantly limited to the domain of training data (e.g., portrait) or fall into the bottleneck of temporal consistency and computation efficiency, especially when the input video involves complex dynamics and long durations. In this paper, we propose TC-Light, a novel generative renderer to overcome these problems. Starting from the video preliminarily relighted by an inflated video relighting model, it optimizes appearance embedding in the first stage to align global illumination. Then it optimizes the proposed canonical video representation, i.e., Unique Video Tensor (UVT), to align fine-grained texture and lighting in the second stage. To comprehensively evaluate performance, we also establish a long and highly dynamic video benchmark. Extensive experiments show that our method enables physically plausible re-rendering results with superior temporal coherence and low computation cost. The code and video demos are available at https://dekuliutesla.github.io/tclight/.

Paper Structure

This paper contains 22 sections, 9 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Relighting results on long videos under various dynamic scenes, averaging 256 frames per clip. Though the video involves frequent changes of foreground objects (row (a)), highly dynamic camera motions (row (b)), the TC-Light realizes consistent and physically plausible relighting results. Row (c) also shows its potential to mitigate the sim2real gap for synthetic renderings.
  • Figure 2: TC-Light overview. Given the source video and text prompt $p$, the model tokenizes input latents in $xy$ plane and $yt$ plane separately. The predicted noises are adaptively combined together for denoising (cf. \ref{['subsec: video model']}). Its output then undergoes two-stage optimization to enhance temporal consistency of illumination and texture, which are respectively detailed in \ref{['subsubsec: exposure']} and \ref{['subsubsec: uvt']}.
  • Figure 3: Qualitative comparison of results. The proposed TC-Light avoids unnatural relighting like Slicedit cohen2024slicedit and COSMOS-Transfer1 alhaija2025cosmos in (a) and blurring like alhaija2025cosmos in (b), or inconsistent illumination like per-frame IC-Light zhang2025scaling and VidToMe li2024vidtome as highlighted by the red squares.
  • Figure 4: Ablation on main module components. The experiment is conducted on one sequence of the InteriorNet InteriorNet18 subset, where the text prompt is "This video showcases a modern interior space, which is dimly lit". The baseline here denotes VidToMe li2024vidtome in \ref{['tab: comparison']}.
  • Figure 5: Qualitative results on additional long highly dynamic videos.
  • ...and 3 more figures