Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos
Zhengze Xu, Mengting Chen, Zhao Wang, Linyu Xing, Zhonghua Zhai, Nong Sang, Jinsong Lan, Shuai Xiao, Changxin Gao
TL;DR
Tunnel Try-on tackles video virtual try-on by combining diffusion-based generation with a focus tunnel that zooms in on clothing regions to preserve fine garment details while Kalman-filtered smoothing and tunnel embeddings maintain temporal coherence. The architecture couples a Main U-Net with a Ref U-Net under Temporal-Attention and augments it with an Environment Encoder to supply global context, along with a Focus Tunnel Extraction strategy and novel enhancements. Two-stage training—image-based then video-based—demonstrates state-of-the-art performance on challenging real-world data, surpassing prior GAN- and diffusion-based methods in both quality and temporal stability. This approach enables robust, high-fidelity, commercially viable video try-on across diverse clothing types and dynamic camera motions, advancing practical applications in the fashion industry.
Abstract
Video try-on is a challenging task and has not been well tackled in previous works. The main obstacle lies in preserving the details of the clothing and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named "Tunnel Try-on." The core idea is excavating a "focus tunnel" in the input video that gives close-up shots around the clothing regions. We zoom in on the region in the tunnel to better preserve the fine details of the clothing. To generate coherent motions, we first leverage the Kalman filter to construct smooth crops in the focus tunnel and inject the position embedding of the tunnel into attention layers to improve the continuity of the generated videos. In addition, we develop an environment encoder to extract the context information outside the tunnels as supplementary cues. Equipped with these techniques, Tunnel Try-on keeps the fine details of the clothing and synthesizes stable and smooth videos. Demonstrating significant advancements, Tunnel Try-on could be regarded as the first attempt toward the commercial-level application of virtual try-on in videos.
