ToonCrafter: Generative Cartoon Interpolation
Jinbo Xing, Hanyuan Liu, Menghan Xia, Yong Zhang, Xintao Wang, Ying Shan, Tien-Tsin Wong
TL;DR
ToonCrafter tackles the challenge of cartoon interpolation by moving from traditional correspondence-based methods to a generative interpolation paradigm that leverages live-action video priors. The approach hinges on three core innovations: toon rectification learning to adapt motion priors to the cartoon domain, a dual-reference-based 3D decoder with a hybrid-attention-residual mechanism to recover detail lost in latent spaces, and a frame-independent sketch encoder that provides flexible, user-controlled guidance. A large-scale cartoon dataset is created to enable domain adaptation, and extensive ablations validate the necessity and effectiveness of each component. Experimental results show substantial improvements over state-of-the-art cartoon interpolation methods in motion realism, temporal coherence, and robustness to dis-occlusion, with additional versatility demonstrated through sketch-based and colorization applications.
Abstract
We introduce ToonCrafter, a novel approach that transcends traditional correspondence-based cartoon video interpolation, paving the way for generative interpolation. Traditional methods, that implicitly assume linear motion and the absence of complicated phenomena like dis-occlusion, often struggle with the exaggerated non-linear and large motions with occlusion commonly found in cartoons, resulting in implausible or even failed interpolation results. To overcome these limitations, we explore the potential of adapting live-action video priors to better suit cartoon interpolation within a generative framework. ToonCrafter effectively addresses the challenges faced when applying live-action video motion priors to generative cartoon interpolation. First, we design a toon rectification learning strategy that seamlessly adapts live-action video priors to the cartoon domain, resolving the domain gap and content leakage issues. Next, we introduce a dual-reference-based 3D decoder to compensate for lost details due to the highly compressed latent prior spaces, ensuring the preservation of fine details in interpolation results. Finally, we design a flexible sketch encoder that empowers users with interactive control over the interpolation results. Experimental results demonstrate that our proposed method not only produces visually convincing and more natural dynamics, but also effectively handles dis-occlusion. The comparative evaluation demonstrates the notable superiority of our approach over existing competitors.
