Table of Contents
Fetching ...

LIM: Large Interpolator Model for Dynamic Reconstruction

Remy Sabathier, Niloy J. Mitra, David Novotny

TL;DR

LIM tackles the challenge of dynamic 4D asset reconstruction by introducing a transformer-based Large Interpolator Model that interpolates implicit 3D representations between two keyframes at continuous times. Built on a multi-view extension of the Large Reconstruction Model (LRM), LIM uses a novel causal consistency loss to enforce temporally coherent interpolations and enables time-resolved, uv-textured mesh tracking. The approach supports both multi-view and monocular inputs (the latter via diffusion-driven view synthesis) and demonstrates superior interpolation quality, faster runtime, and robust mesh tracing compared to baselines and ablations. This yields production-friendly, high-fidelity 4D reconstructions suitable for real-time pipelines and downstream applications. Key contributions include the integration of canonical surface coordinates for mesh tracing and a causally consistent training objective that preserves temporal coherence across arbitrary interpolation times.

Abstract

Reconstructing dynamic assets from video data is central to many in computer vision and graphics tasks. Existing 4D reconstruction approaches are limited by category-specific models or slow optimization-based methods. Inspired by the recent Large Reconstruction Model (LRM), we present the Large Interpolation Model (LIM), a transformer-based feed-forward solution, guided by a novel causal consistency loss, for interpolating implicit 3D representations across time. Given implicit 3D representations at times $t_0$ and $t_1$, LIM produces a deformed shape at any continuous time $t\in[t_0,t_1]$, delivering high-quality interpolated frames in seconds. Furthermore, LIM allows explicit mesh tracking across time, producing a consistently uv-textured mesh sequence ready for integration into existing production pipelines. We also use LIM, in conjunction with a diffusion-based multiview generator, to produce dynamic 4D reconstructions from monocular videos. We evaluate LIM on various dynamic datasets, benchmarking against image-space interpolation methods (e.g., FiLM) and direct triplane linear interpolation, and demonstrate clear advantages. In summary, LIM is the first feed-forward model capable of high-speed tracked 4D asset reconstruction across diverse categories.

LIM: Large Interpolator Model for Dynamic Reconstruction

TL;DR

LIM tackles the challenge of dynamic 4D asset reconstruction by introducing a transformer-based Large Interpolator Model that interpolates implicit 3D representations between two keyframes at continuous times. Built on a multi-view extension of the Large Reconstruction Model (LRM), LIM uses a novel causal consistency loss to enforce temporally coherent interpolations and enables time-resolved, uv-textured mesh tracking. The approach supports both multi-view and monocular inputs (the latter via diffusion-driven view synthesis) and demonstrates superior interpolation quality, faster runtime, and robust mesh tracing compared to baselines and ablations. This yields production-friendly, high-fidelity 4D reconstructions suitable for real-time pipelines and downstream applications. Key contributions include the integration of canonical surface coordinates for mesh tracing and a causally consistent training objective that preserves temporal coherence across arbitrary interpolation times.

Abstract

Reconstructing dynamic assets from video data is central to many in computer vision and graphics tasks. Existing 4D reconstruction approaches are limited by category-specific models or slow optimization-based methods. Inspired by the recent Large Reconstruction Model (LRM), we present the Large Interpolation Model (LIM), a transformer-based feed-forward solution, guided by a novel causal consistency loss, for interpolating implicit 3D representations across time. Given implicit 3D representations at times and , LIM produces a deformed shape at any continuous time , delivering high-quality interpolated frames in seconds. Furthermore, LIM allows explicit mesh tracking across time, producing a consistently uv-textured mesh sequence ready for integration into existing production pipelines. We also use LIM, in conjunction with a diffusion-based multiview generator, to produce dynamic 4D reconstructions from monocular videos. We evaluate LIM on various dynamic datasets, benchmarking against image-space interpolation methods (e.g., FiLM) and direct triplane linear interpolation, and demonstrate clear advantages. In summary, LIM is the first feed-forward model capable of high-speed tracked 4D asset reconstruction across diverse categories.

Paper Structure

This paper contains 38 sections, 5 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Large Interpolator Model (LIM) outputs a 4D video reconstruction by interpolating 3D implicit representations between two consecutive keyframes at times $t=0$ and $t=1$, which can then be used to produce 3D-consistent RGB, depth, or decoded as tracked mesh sequences.
  • Figure 2: LIM framework. (Left) Given multi-view images on 2 timesteps $k$ and $k+1$, $\mathop{\mathrm{LIM}}\nolimits$ interpolates any intermediate 3D representation at $k+\alpha, \alpha \in [0,1]$. It achieves this notably via cross-attention with the latest intermediate features of $\mathop{\mathrm{LRM}}\nolimits$ on keyframe $k$. In practice, our $\mathop{\mathrm{LIM}}\nolimits$ architecture has 6 blocks and $\mathop{\mathrm{LRM}}\nolimits$ 12 blocks. (Right) Block structure of $\mathop{\mathrm{LRM}}\nolimits$ and $\mathop{\mathrm{LIM}}\nolimits$. We include layer normalization before each module in blocks.
  • Figure 3: LRM conditioned on a single-view tochilkin_triposr_2024 is sensitive to small changes on the input image, which gives inconsistent result from one video frame to another. The multi-view LRM prevents this instability. For each model, left shows an input-view, right shows two target views. Each line is a different timestep.
  • Figure 4: Interpolation results comparing (i) linear interpolation in triplane space, which fails on dynamic parts; (ii) image-based interpolator reda2022filmframeinterpolationlarge (FILM), yielding view-consistent frame interpolations leading to defective reconstructions (ghosting around dynamic parts; for example, the tip of the elephant's trunk or fish's tail); and (iii) our $\mathop{\mathrm{LIM}}\nolimits$-based interpolation, which yields the most plausible results.
  • Figure 5: Mesh Tracking results. Given two implicit 3D representations, $\mathop{\mathrm{LIM}}\nolimits$ can interpolate densely in time and hence can track a source mesh to produce a deforming mesh sequence. For each scene, we show (top to bottom) RGB rendering of the tracked mesh, depth and canonical-coordinate interpolation. See supplemental video on the https://remysabathier.github.io/lim.github.io/.
  • ...and 4 more figures