Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

Zhengze Xu; Mengting Chen; Zhao Wang; Linyu Xing; Zhonghua Zhai; Nong Sang; Jinsong Lan; Shuai Xiao; Changxin Gao

Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

Zhengze Xu, Mengting Chen, Zhao Wang, Linyu Xing, Zhonghua Zhai, Nong Sang, Jinsong Lan, Shuai Xiao, Changxin Gao

TL;DR

Tunnel Try-on tackles video virtual try-on by combining diffusion-based generation with a focus tunnel that zooms in on clothing regions to preserve fine garment details while Kalman-filtered smoothing and tunnel embeddings maintain temporal coherence. The architecture couples a Main U-Net with a Ref U-Net under Temporal-Attention and augments it with an Environment Encoder to supply global context, along with a Focus Tunnel Extraction strategy and novel enhancements. Two-stage training—image-based then video-based—demonstrates state-of-the-art performance on challenging real-world data, surpassing prior GAN- and diffusion-based methods in both quality and temporal stability. This approach enables robust, high-fidelity, commercially viable video try-on across diverse clothing types and dynamic camera motions, advancing practical applications in the fashion industry.

Abstract

Video try-on is a challenging task and has not been well tackled in previous works. The main obstacle lies in preserving the details of the clothing and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named "Tunnel Try-on." The core idea is excavating a "focus tunnel" in the input video that gives close-up shots around the clothing regions. We zoom in on the region in the tunnel to better preserve the fine details of the clothing. To generate coherent motions, we first leverage the Kalman filter to construct smooth crops in the focus tunnel and inject the position embedding of the tunnel into attention layers to improve the continuity of the generated videos. In addition, we develop an environment encoder to extract the context information outside the tunnels as supplementary cues. Equipped with these techniques, Tunnel Try-on keeps the fine details of the clothing and synthesizes stable and smooth videos. Demonstrating significant advancements, Tunnel Try-on could be regarded as the first attempt toward the commercial-level application of virtual try-on in videos.

Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

TL;DR

Abstract

Paper Structure (28 sections, 2 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 2 equations, 7 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Image Visual Try-on
Video Visual Try-on
Image Animation
Method
Preliminaries
Overall Architecture
Image try-on baseline.
Adaption for videos.
Novel designs of Tunnel Try-on.
Focus Tunnel Extraction
Focus Tunnel Enhancement
Tunnel smoothing.
Tunnel embedding.
...and 13 more sections

Figures (7)

Figure 1: Generated results of Tunnel Try-on. Our model achieves state-of-the-art performance in the video try-on task. It can not only handle complex clothing and backgrounds but also adapt to different types of human movements in the video (first and second rows) and camera angle changes (third row).
Figure 2: The overview of Tunnel Try-on. Given an input video and a clothing image, we first extract a focus tunnel to zoom in on the region around the garments to better preserve the details. The zoomed region is represented by a sequence of tensors consisting of the background latent, latent noise, and the garment mask, which are concatenated and fed into the Main U-Net. At the same time, we use a Ref U-Net and a CLIP Encoder to extract the representations of the clothing image. These clothing representations are then added to the Main U-Net using ref-attention. Moreover, human pose information is added into the latent feature to assist in generation. The tunnel embedding is also integrated into temporal attention to generating more consistent motions, and an environment encoder is developed to extract the global context as additional guidance.
Figure 3: Qualitative comparison with existing alternatives on the VVT dataset. The clothing and target person is shown in (a). The results of (b) FW-GAN, (c) PBAFN, (d) ClothFormer, (e) StableVITON, and (f) Tunnel Try-on are represented respectively.
Figure 4: Qualitative results of Tunnel Try-on on our dataset. We present the try-on results of pants and skirts, as well as cross-category try-on results.
Figure 5: Qualitative ablations for the focus tunnel. This zoom-in strategy brings notable improvements for preserving the fine details of the clothing.
...and 2 more figures

Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

TL;DR

Abstract

Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (7)