Table of Contents
Fetching ...

Toon3D: Seeing Cartoons from New Perspectives

Ethan Weber, Riley Peterlinz, Rohan Mathur, Frederik Warburg, Alexei A. Efros, Angjoo Kanazawa

TL;DR

Toon3D tackles 3D reconstruction from geometrically inconsistent cartoon and anime imagery by deforming input views and leveraging monocular depth priors to recover coherent camera poses and geometry. It introduces a deformable, piecewise-rigid optimization that aligns sparse 2D correspondences in 3D while warping images to satisfy a perspective camera model, coupled with ARAP-like regularization and depth restraints. A new Toon3D Dataset and a web-based Toon3D Labeler enable human-in-the-loop annotation for 12 scenes, supporting novel-view synthesis via Gaussian Splatting. Empirically, Toon3D yields more reliable poses and 3D geometry than COLMAP or DUSt3R on cartoons and even validates reconstruction on paintings and sparse Airbnb views, demonstrating practical utility for art-centric 3D understanding and novel-view visualization.

Abstract

We recover the underlying 3D structure from images of cartoons and anime depicting the same scene. This is an interesting problem domain because images in creative media are often depicted without explicit geometric consistency for storytelling and creative expression-they are only 3D in a qualitative sense. While humans can easily perceive the underlying 3D scene from these images, existing Structure-from-Motion (SfM) methods that assume 3D consistency fail catastrophically. We present Toon3D for reconstructing geometrically inconsistent images. Our key insight is to deform the input images while recovering camera poses and scene geometry, effectively explaining away geometrical inconsistencies to achieve consistency. This process is guided by the structure inferred from monocular depth predictions. We curate a dataset with multi-view imagery from cartoons and anime that we annotate with reliable sparse correspondences using our user-friendly annotation tool. Our recovered point clouds can be plugged into novel-view synthesis methods to experience cartoons from viewpoints never drawn before. We evaluate against classical and recent learning-based SfM methods, where Toon3D is able to obtain more reliable camera poses and scene geometry.

Toon3D: Seeing Cartoons from New Perspectives

TL;DR

Toon3D tackles 3D reconstruction from geometrically inconsistent cartoon and anime imagery by deforming input views and leveraging monocular depth priors to recover coherent camera poses and geometry. It introduces a deformable, piecewise-rigid optimization that aligns sparse 2D correspondences in 3D while warping images to satisfy a perspective camera model, coupled with ARAP-like regularization and depth restraints. A new Toon3D Dataset and a web-based Toon3D Labeler enable human-in-the-loop annotation for 12 scenes, supporting novel-view synthesis via Gaussian Splatting. Empirically, Toon3D yields more reliable poses and 3D geometry than COLMAP or DUSt3R on cartoons and even validates reconstruction on paintings and sparse Airbnb views, demonstrating practical utility for art-centric 3D understanding and novel-view visualization.

Abstract

We recover the underlying 3D structure from images of cartoons and anime depicting the same scene. This is an interesting problem domain because images in creative media are often depicted without explicit geometric consistency for storytelling and creative expression-they are only 3D in a qualitative sense. While humans can easily perceive the underlying 3D scene from these images, existing Structure-from-Motion (SfM) methods that assume 3D consistency fail catastrophically. We present Toon3D for reconstructing geometrically inconsistent images. Our key insight is to deform the input images while recovering camera poses and scene geometry, effectively explaining away geometrical inconsistencies to achieve consistency. This process is guided by the structure inferred from monocular depth predictions. We curate a dataset with multi-view imagery from cartoons and anime that we annotate with reliable sparse correspondences using our user-friendly annotation tool. Our recovered point clouds can be plugged into novel-view synthesis methods to experience cartoons from viewpoints never drawn before. We evaluate against classical and recent learning-based SfM methods, where Toon3D is able to obtain more reliable camera poses and scene geometry.
Paper Structure (21 sections, 8 equations, 13 figures, 2 tables)

This paper contains 21 sections, 8 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Reconstructing a 3D scene from 3D inconsistent images. Cartoons and animations often depict scenes that are not geometrically consistent by design (left), making it challenging for classical Structure-from-Motion (SfM) techniques to reconstruct these scenes as they assume 3D consistency (middle). However, humans can easily perceive the underlying 3D scene from these images. We introduce Toon3D, which addresses these challenges by deforming images during reconstruction to account for geometric inconsistencies and leveraging monocular depth priors. The middle column illustrates how Bundle Adjustment fails, even with manually labeled correspondences, resulting in scattered Gaussian splats (top) and misaligned camera reconstructions visualized by backprojected monodepths (bottom). The right column shows our Toon3D results, with more coherent Gaussian splats (top) and well-structured point clouds and camera views (bottom), demonstrating significantly improved 3D consistency. Our project page is https://toon3d.studio/.
  • Figure 2: Toon3D overview. Our framework consists of labeling images with our interactive Toon3D Labeler tool, recovering camera poses and aligning a dense point cloud, and visualizing the dense reconstruction with Gaussians to create an immersive visual experience.
  • Figure 3: Toon3D alignment. The camera alignment objective aligns the point clouds while optimizing for camera intrinsics and extrinsics. Deformation alignment deforms the images to obey a perspective camera model. In practice, our method uses all the losses described here to obtain an aligned point cloud and posed images.
  • Figure 4: 3D alignment ablations. Row 1 (Rick and Morty House) shows regularization's impact on scene shaping. Optimized shift and scale parameters can adjust point clouds to better align at correspondences. This is evident as the starred points converge. The aspect regularization keeps the optimized image close to its original aspect ratio. Row 2 (BoJack Horseman House) explores the effects of different warp regularizers ($\mathcal{L}_{ARAP_{2D}}$ and $\mathcal{L}_{z}$) on scene warping. Without any regularization, warping distorts scene geometry. ARAP alone results in poor 3D warps due to inaccurate depth. $z$ regularization alone limits scene movement, maintaining rigid structures close to the original depth map. Using both strikes a good balance between correctly positioning geometry and preserving structural integrity.
  • Figure 5: 3D reconstructions of cartoons. Off-the-shelf methods like COLMAP fail completely. State-of-the-art learning based method DUSt3R wang2024dust3r also fails catastrophically on many scenes even with labeled correspondences (left). Our method (middle), recovers reliable camera, and plausible pointcloud, which can be visualized with Gaussians for a more immersive experience. For the SpongeBob scene (top), we label point correspondences between walls to reconstruct two rooms together. Notably, our method works with different depth predictors. From top to bottom, we show results with MoGe wang2024moge, Depth Anything V2 yang2024depth, and Marigold ke2023repurposing_marigold.
  • ...and 8 more figures