Table of Contents
Fetching ...

Coarse-To-Fine Tensor Trains for Compact Visual Representations

Sebastian Loeschcke, Dan Wang, Christian Leth-Espensen, Serge Belongie, Michael J. Kastoryano, Sagie Benaim

TL;DR

This work tackles the challenge of learning compact, high-quality visual representations with tensor networks by introducing PuTT, a coarse-to-fine optimization framework for Quantized Tensor Trains (QTT). A global Matrix Product Operator (MPO) prolongation upscales learned TT representations across resolutions, with TT-SVD truncation to cap ranks and stabilize training, enabling scalable, memory-efficient models. PuTT demonstrates superior performance in compression, denoising, and learning from incomplete data across 2D/3D fitting and novel view synthesis, notably under high compression and when data is partially observed. The approach holds promise for large-scale neural radiance fields and dynamic visual representations by leveraging the logarithmic dimensionality advantages of QTTs and a robust coarse-to-fine training paradigm.

Abstract

The ability to learn compact, high-quality, and easy-to-optimize representations for visual data is paramount to many applications such as novel view synthesis and 3D reconstruction. Recent work has shown substantial success in using tensor networks to design such compact and high-quality representations. However, the ability to optimize tensor-based representations, and in particular, the highly compact tensor train representation, is still lacking. This has prevented practitioners from deploying the full potential of tensor networks for visual data. To this end, we propose 'Prolongation Upsampling Tensor Train (PuTT)', a novel method for learning tensor train representations in a coarse-to-fine manner. Our method involves the prolonging or `upsampling' of a learned tensor train representation, creating a sequence of 'coarse-to-fine' tensor trains that are incrementally refined. We evaluate our representation along three axes: (1). compression, (2). denoising capability, and (3). image completion capability. To assess these axes, we consider the tasks of image fitting, 3D fitting, and novel view synthesis, where our method shows an improved performance compared to state-of-the-art tensor-based methods. For full results see our project webpage: https://sebulo.github.io/PuTT_website/

Coarse-To-Fine Tensor Trains for Compact Visual Representations

TL;DR

This work tackles the challenge of learning compact, high-quality visual representations with tensor networks by introducing PuTT, a coarse-to-fine optimization framework for Quantized Tensor Trains (QTT). A global Matrix Product Operator (MPO) prolongation upscales learned TT representations across resolutions, with TT-SVD truncation to cap ranks and stabilize training, enabling scalable, memory-efficient models. PuTT demonstrates superior performance in compression, denoising, and learning from incomplete data across 2D/3D fitting and novel view synthesis, notably under high compression and when data is partially observed. The approach holds promise for large-scale neural radiance fields and dynamic visual representations by leveraging the logarithmic dimensionality advantages of QTTs and a robust coarse-to-fine training paradigm.

Abstract

The ability to learn compact, high-quality, and easy-to-optimize representations for visual data is paramount to many applications such as novel view synthesis and 3D reconstruction. Recent work has shown substantial success in using tensor networks to design such compact and high-quality representations. However, the ability to optimize tensor-based representations, and in particular, the highly compact tensor train representation, is still lacking. This has prevented practitioners from deploying the full potential of tensor networks for visual data. To this end, we propose 'Prolongation Upsampling Tensor Train (PuTT)', a novel method for learning tensor train representations in a coarse-to-fine manner. Our method involves the prolonging or `upsampling' of a learned tensor train representation, creating a sequence of 'coarse-to-fine' tensor trains that are incrementally refined. We evaluate our representation along three axes: (1). compression, (2). denoising capability, and (3). image completion capability. To assess these axes, we consider the tasks of image fitting, 3D fitting, and novel view synthesis, where our method shows an improved performance compared to state-of-the-art tensor-based methods. For full results see our project webpage: https://sebulo.github.io/PuTT_website/
Paper Structure (43 sections, 11 equations, 26 figures, 10 tables)

This paper contains 43 sections, 11 equations, 26 figures, 10 tables.

Figures (26)

  • Figure 1: Training one level of hierarchy. Initially, a batch $B$ serves two purposes: (1). sample data points $Y_B$ from the target input, (2). it is transformed into QTT indices $\hat{B}$. Subsequently, corresponding values $\hat{Y}_{\hat{B}}$ are sampled from the QTT. The reconstruction loss between $Y_B$ and $\hat{Y}_{\hat{B}}$ is used and backpropagated.
  • Figure 2: Illustration of QTT upsampling process. Beginning with a QTT $T_D$ of length $D$, upsampling is achieved through the prolongation MPO $\mathcal{P}$. This involves connecting the $D+1$ cores of $\mathcal{P}$ to the corresponding cores of $T_D$ and then contracting their shared indices. The result is a new QTT, $T_{D+1}$, of length $D+1$, where ranks are increased to $R_i\hat{R}_i$. To manage this rank growth, TT-SVDTT_decomp_oseledets is employed for rank truncation. This process yields $T_{D+1}$ with controlled ranks $\tilde{R}_i \leq R_{max}$.
  • Figure 3: Full coarse-to-fine learning, integrating 'Train' and 'Upsample' phases of Fig.\ref{['fig:train']} and Fig.\ref{['fig:prolongation_method']}. We start with input $I_D$, which is downscaled to $I_{D-l}$ (resolution $2^{d-l}\times 2^{d-l}$). A QTT is then randomly initialized for $I_{D-l}$. After training this QTT (as per Fig.\ref{['fig:train']}) up to iteration $i_1$, we proceed with upsampling (as illustrated in Fig. \ref{['fig:prolongation_method']}), producing $T_{D-l+1}$ of length $D-l+1$. This represents a grid at resolution $2^{d-l+1}\times 2^{d-l+1}$ and requires sampling from a newly downsampled target $I_{D-l+1}$ of the same resolution.
  • Figure 4: Novel view synthesis with PuTT. This illustration shows the process using two QTTs, $T_\sigma$ and $T_c$, for computing volume density ($\sigma$) and view-dependent color ($c$) through differential ray marching. For each 3D location $x=(x,y,z)$ and viewing direction $d$, $x$ is converted to QTT indices. These indices query the density voxel grid $T_\sigma$ and the color voxel grid $T_c$. The process uses trilinear interpolation to obtain continuous $\sigma$ values and appearance feature vectors $\hat{c}$. These vectors are processed through a shading module to generate raw color values $c$, which, along with $\sigma$, are used in differential volumetric rendering. The rendering loss is computed by comparing the generated color values against ground truth (g.t.) values.
  • Figure 5: Qualitative comparison to baselines on 16k images.
  • ...and 21 more figures