Table of Contents
Fetching ...

ToonCrafter: Generative Cartoon Interpolation

Jinbo Xing, Hanyuan Liu, Menghan Xia, Yong Zhang, Xintao Wang, Ying Shan, Tien-Tsin Wong

TL;DR

ToonCrafter tackles the challenge of cartoon interpolation by moving from traditional correspondence-based methods to a generative interpolation paradigm that leverages live-action video priors. The approach hinges on three core innovations: toon rectification learning to adapt motion priors to the cartoon domain, a dual-reference-based 3D decoder with a hybrid-attention-residual mechanism to recover detail lost in latent spaces, and a frame-independent sketch encoder that provides flexible, user-controlled guidance. A large-scale cartoon dataset is created to enable domain adaptation, and extensive ablations validate the necessity and effectiveness of each component. Experimental results show substantial improvements over state-of-the-art cartoon interpolation methods in motion realism, temporal coherence, and robustness to dis-occlusion, with additional versatility demonstrated through sketch-based and colorization applications.

Abstract

We introduce ToonCrafter, a novel approach that transcends traditional correspondence-based cartoon video interpolation, paving the way for generative interpolation. Traditional methods, that implicitly assume linear motion and the absence of complicated phenomena like dis-occlusion, often struggle with the exaggerated non-linear and large motions with occlusion commonly found in cartoons, resulting in implausible or even failed interpolation results. To overcome these limitations, we explore the potential of adapting live-action video priors to better suit cartoon interpolation within a generative framework. ToonCrafter effectively addresses the challenges faced when applying live-action video motion priors to generative cartoon interpolation. First, we design a toon rectification learning strategy that seamlessly adapts live-action video priors to the cartoon domain, resolving the domain gap and content leakage issues. Next, we introduce a dual-reference-based 3D decoder to compensate for lost details due to the highly compressed latent prior spaces, ensuring the preservation of fine details in interpolation results. Finally, we design a flexible sketch encoder that empowers users with interactive control over the interpolation results. Experimental results demonstrate that our proposed method not only produces visually convincing and more natural dynamics, but also effectively handles dis-occlusion. The comparative evaluation demonstrates the notable superiority of our approach over existing competitors.

ToonCrafter: Generative Cartoon Interpolation

TL;DR

ToonCrafter tackles the challenge of cartoon interpolation by moving from traditional correspondence-based methods to a generative interpolation paradigm that leverages live-action video priors. The approach hinges on three core innovations: toon rectification learning to adapt motion priors to the cartoon domain, a dual-reference-based 3D decoder with a hybrid-attention-residual mechanism to recover detail lost in latent spaces, and a frame-independent sketch encoder that provides flexible, user-controlled guidance. A large-scale cartoon dataset is created to enable domain adaptation, and extensive ablations validate the necessity and effectiveness of each component. Experimental results show substantial improvements over state-of-the-art cartoon interpolation methods in motion realism, temporal coherence, and robustness to dis-occlusion, with additional versatility demonstrated through sketch-based and colorization applications.

Abstract

We introduce ToonCrafter, a novel approach that transcends traditional correspondence-based cartoon video interpolation, paving the way for generative interpolation. Traditional methods, that implicitly assume linear motion and the absence of complicated phenomena like dis-occlusion, often struggle with the exaggerated non-linear and large motions with occlusion commonly found in cartoons, resulting in implausible or even failed interpolation results. To overcome these limitations, we explore the potential of adapting live-action video priors to better suit cartoon interpolation within a generative framework. ToonCrafter effectively addresses the challenges faced when applying live-action video motion priors to generative cartoon interpolation. First, we design a toon rectification learning strategy that seamlessly adapts live-action video priors to the cartoon domain, resolving the domain gap and content leakage issues. Next, we introduce a dual-reference-based 3D decoder to compensate for lost details due to the highly compressed latent prior spaces, ensuring the preservation of fine details in interpolation results. Finally, we design a flexible sketch encoder that empowers users with interactive control over the interpolation results. Experimental results demonstrate that our proposed method not only produces visually convincing and more natural dynamics, but also effectively handles dis-occlusion. The comparative evaluation demonstrates the notable superiority of our approach over existing competitors.
Paper Structure (18 sections, 6 equations, 10 figures, 4 tables)

This paper contains 18 sections, 6 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1:
  • Figure 2: Overview of the proposed ToonCrafter. Given two cartoon images $\mathbf{x}^1$ and $\mathbf{x}^L$, ToonCrafter leverages the image-to-video generative diffusion model as a generator to generate intermediate frame latents $\mathbf{z}_0$. These latents are subsequently decoded into pixel space through the proposed detail-injected decoder with $\mathbf{x}^1$ and $\mathbf{x}^L$ as detail guidance. Optionally, the interpolation can be controlled with sparse sketch guidance.
  • Figure 3: Illustration of the detail-injected 3D decoder. Given frame latents $\mathbf{z}$ as input, we inject the intermediate features of input images $\mathbf{x}^1$ and $\mathbf{x}^L$ from encoder $\mathcal{E}$ through cross-attention in shallow layers, while via residual learning, i.e., addition to features of 1-st and $L$-th frame in deep layers.
  • Figure 4: Examples of different patterns of sketch-guidance: (top) bisection ($n$=1) and (bottom) random position.
  • Figure 5: Visual comparison of the interpolation frames generated by variants with different rectification strategies.
  • ...and 5 more figures